Tag
Microsoft's NextLat introduces a training objective that rewards belief-state representations instead of relying solely on next-token prediction, pushing models toward compact world models for better generalization.
A comprehensive free guide explaining LLMs from first principles, covering tokens, transformers, attention, fine-tuning, and local deployment.
The third edition of the Speech and Language Processing textbook by Jurafsky and Martin was released in January 2026, featuring a clear explanation of Transformers and various updates including new chapters on ASR, TTS, and DPO.
A tweet highlights the Transformer architecture chapter from Jurafsky and Martin's textbook, praising its clear and mathematically grounded explanation of self-attention, multi-head attention, and related mechanisms.
Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.
Noam Shazeer, a key researcher behind transformers and MoE, is joining OpenAI as head of architecture research, moving from Google.
This blog post introduces a benchmark methodology for evaluating how well open models perform on agentic coding tasks, focusing not just on accuracy but on the efficiency of the agent's process. It provides a customizable tooling harness using the pi coding agent and tests across models and library revisions.
Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.
This paper provides a theoretical analysis of deep transformers' ability to model hierarchical structures using bounded-depth context-free grammars, constructing explicit positional-attention transformers that encode grammatical states in linearly separable subspaces.
MorphStrata introduces a layer-specific stochastic noise injection strategy for generating diverse student models in a Moving Target Defense framework to enhance adversarial robustness in time-series forecasting, achieving up to 97.97% improvement in RMSE under BIM attacks with minimal training overhead.
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
The paper reveals that latent reasoning in transformer-based reasoning models (TRMs) functions as a policy improvement operator, and proposes an algorithm that enhances learning and inference efficiency by up to 18x.
This paper introduces RecurrReason, a difficulty-controlled benchmark of four symbolic logic puzzles to evaluate multi-step reasoning in sequence models. Fine-tuning experiments on T5 and GPT-2 show that architecture determines success more than scale, and that pre-training transfer depends on local transition structure.
This paper trains a two-layer transformer encoder to classify rational elliptic curves by rank from Frobenius traces, achieving >99% accuracy. Mechanistic interpretability reveals the model learns the Mestre-Nagao heuristic and concentrates attention on prime positions, demonstrating that transformers can learn number-theoretic algorithms.
This study investigates machine learning models to predict exam outcomes using physiological data such as electrodermal activity, heart rate, and skin temperature, finding that both deep learning approaches and simpler models like random forests can be effective.
LoopCoder-v2 proposes Parallel Loop Transformers (PLT) for efficient test-time computation scaling in code generation, showing that two loops yield significant gains while more loops cause diminishing returns and positional mismatch costs.
Proposes a nonuniform width allocation transformer (hourglass shape) that outperforms uniform baselines in language modeling, reducing FLOPs and KV cache size.
This thread argues that standard transformers have a topological flaw: once a state representation reaches the top layer, they cannot update beliefs over time, causing collapse as depth increases.
A reflection on the broad implications of transformer architectures beyond LLMs, including potential impacts on linguistics, genetics, and causal modeling, comparing their significance to the Haber-Bosch process.
This paper presents an end-to-end hybrid framework for rumour detection in low-resource Algerian dialect social media content, achieving an F1-score of 0.84 by combining transformer embeddings with a classical classifier.