Tag
A tweet explaining the core formula of the attention mechanism in transformer models: Q × Kᵀ computes relevance, Softmax converts to probabilities, and V delivers content, forming the foundation of modern AI.
LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.
The article discusses the arbitrariness of AI model creation, proposing to draw inspiration from physics models, build a repository of candidate models, and formalize the model selection process.
A developer built a transformer model entirely from scratch in TypeScript, including a custom autograd engine, and released it as an open-source educational tool on GitHub.
A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.
This paper introduces the Proper Scoring Ensemble Filter (PSEF), a transformer-based method for Bayesian filtering that trains an analysis map using strictly proper scoring rules on synthetic state-observation trajectories. It demonstrates superior performance in nonlinear, non-Gaussian filtering tasks compared to classical and learning-based methods.
This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.
Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.
PMDformer introduces patch-mean decoupling and specialized attention mechanisms to improve shape similarity modeling in long-term time series forecasting, outperforming existing methods on multiple benchmarks.
Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.
Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.
This paper presents fine-tuning of PEGASUS on the XL-Sum English corpus, achieving state-of-the-art results with significant improvements over the baseline mT5 model across ROUGE scores.
Proposes the Multi-Stream Fraud Transformer (MSFT) for financial fraud detection, which independently encodes transaction, login, and risk event streams using Transformers and fuses them with time-aware positional encoding and gated fusion, achieving 0.9961 AUROC on a large dataset.
This paper investigates why accumulated token-dependent orthogonal transformations, such as those used in PaTH Attention and a simplified variant with SO(2) rotations, enable length extrapolation in transformers. It proves that such transformations become incoherent after a finite number of steps, suppressing attention to distant tokens, and shows both theoretically and experimentally that this mechanism improves extrapolation but eventually degrades at extreme context lengths.
This paper introduces LDM-v0, a large decision model trained offline on trajectories from thousands of diverse reinforcement learning environments, demonstrating that a single transformer policy can match the performance of task-specific policies across robotics, autonomous driving, inventory management, cybersecurity, trading, and video games.
Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.
Christopher Manning is spotlighted as a keynote speaker at the AGI Summit, highlighting his pioneering research in NLP, including GloVe and the attention mechanism, as well as his role at Stanford.
Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.
A tweet introducing 'The Transformer Cookbook', a paper that provides a beautiful introduction to hardcoding algorithms (addition, lookup, branching) inside transformer weights, following the RASP paper.
Proposes a knowledge-guided two-stage transfer learning framework using a lightweight GPT-2-style Transformer for cross-domain bearing fault diagnosis with limited data, achieving 92.61% accuracy with only 10% labeled data.