Tag
Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.
This paper presents fine-tuning of PEGASUS on the XL-Sum English corpus, achieving state-of-the-art results with significant improvements over the baseline mT5 model across ROUGE scores.
Proposes the Multi-Stream Fraud Transformer (MSFT) for financial fraud detection, which independently encodes transaction, login, and risk event streams using Transformers and fuses them with time-aware positional encoding and gated fusion, achieving 0.9961 AUROC on a large dataset.
This paper investigates why accumulated token-dependent orthogonal transformations, such as those used in PaTH Attention and a simplified variant with SO(2) rotations, enable length extrapolation in transformers. It proves that such transformations become incoherent after a finite number of steps, suppressing attention to distant tokens, and shows both theoretically and experimentally that this mechanism improves extrapolation but eventually degrades at extreme context lengths.
This paper introduces LDM-v0, a large decision model trained offline on trajectories from thousands of diverse reinforcement learning environments, demonstrating that a single transformer policy can match the performance of task-specific policies across robotics, autonomous driving, inventory management, cybersecurity, trading, and video games.
Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.
Christopher Manning is spotlighted as a keynote speaker at the AGI Summit, highlighting his pioneering research in NLP, including GloVe and the attention mechanism, as well as his role at Stanford.
Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.
A tweet introducing 'The Transformer Cookbook', a paper that provides a beautiful introduction to hardcoding algorithms (addition, lookup, branching) inside transformer weights, following the RASP paper.
Proposes a knowledge-guided two-stage transfer learning framework using a lightweight GPT-2-style Transformer for cross-domain bearing fault diagnosis with limited data, achieving 92.61% accuracy with only 10% labeled data.
This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.
A benchmark study comparing traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer variants (DistilBERT, TinyBERT, MobileBERT) for on-device fault detection across three public datasets. Traditional ML offers competitive accuracy at far smaller resource footprints, while TinyBERT-4L is the most deployment-friendly transformer.
NeuroSonic introduces a conditional flow-matching framework for reconstructing continuous speech from EEG signals, addressing the structural mismatch between neural and acoustic data by learning a deterministic probability-flow velocity field. It achieves up to 26.3% improvement in perceptual quality over existing GAN, diffusion, and mean-flow baselines on cross-subject benchmarks.
SurfBind, a surface-centric learning framework for epitope prediction, uses Transformer-based architecture with patch-level surface modeling and binder-aware cross-attention to achieve state-of-the-art performance on epitope identification benchmarks.
Introduces AutoSpecNER, an expert-annotated dataset for fine-grained named entity recognition in vehicle listings, with 659 advertisements annotated across 15 entity types. Benchmark results show DeBERTa achieves 90% micro-F1, outperforming rule-based and LLM approaches.
A comprehensive survey of transformer-based language models covering architectures, applications across domain verticals (healthcare, finance, legal, etc.), and critical assessment of trade-offs including compute cost, alignment, and data provenance.
SURGeLLM introduces a unified transformer framework with surgical feature gates, task-conditioned prefix tokens, and instance-weighted normalization to address mismatched inductive biases, class imbalance, and lexical knowledge injection in multi-task learning, achieving significant gains across four diverse NLP tasks.
Nvidia has quietly acquihired the team from Essential AI, including Transformer paper coauthor Ashish Vaswani, who was struggling to raise funds for his startup. Vaswani will work on Nvidia's Nemotron open-source models.
This paper introduces a new approach leveraging certainty in transformer models, building on the 'Attention Is All You Need' paradigm.
Release of free workshop recordings and materials (23 videos, 250 slides, 50 exercises) for building your own LLM from fundamentals to transformer architecture, with no math or ML prerequisites.