Tag
Flexformer proposes a flexible linear Transformer with fully learnable attention kernels using random Fourier features, achieving linear complexity while matching or exceeding softmax attention performance on language modeling and sequence classification tasks.
MultiHashFormer is a hash-based generative language model that represents each token as a unique hash signature, enabling parameter-efficient autoregression. It outperforms standard Transformer LMs at 100M, 1B, and 3B scales and supports multilingual vocabulary expansion without increasing parameters.
This paper presents a multi-stage explainable framework that combines SHAP-based token attribution, theory-informed linguistic features, and LLaMA-3.1-70B-Instruct LLM reasoning to interpret transformer-based speech models for cognitive impairment detection, achieving strong clinical alignment and high usability scores.
This paper systematically studies how temporal metadata can be structurally embedded into named entity recognition (NER) models for historical texts. Experiments with absolute and relative temporal representations injected via early or late fusion mechanisms show that late fusion strategies yield more robust performance on French and German historical datasets.
Introduces LPES, a layer-specific positional embedding scaling method that mitigates the 'lost-in-the-middle' problem in LLMs by assigning distinct scaling factors per layer using a genetic algorithm with Bézier curves, achieving up to 11.2% accuracy gain without fine-tuning or latency increase.
The Prism Transformer replaces uniform multi-head attention with a progressive head schedule that increases head count across layers, enabling a local-to-global hierarchy without extra parameters or FLOPs. It consistently outperforms standard Transformers on language modeling and zero-shot benchmarks at 124M, 354M, and 757M scales.
The paper introduces the context-ready transformer, a recurrent architecture that pre-contextualizes tokens before the transformer block, achieving significant inference speedups (e.g., 1.7x on A100) while matching or exceeding standard transformer performance with fewer layers.
Stanford's CS336 course on language modeling from scratch is announced, featuring intensive hands-on assignments covering tokenizers, transformers, data, and alignment.
NanoEuler is a GPT-2-scale language model built entirely from scratch in C/CUDA without any ML libraries, including hand-written forward/backward passes, a byte-level BPE tokenizer, and training pipeline. The project is an educational artifact demonstrating the engineering behind transformer training and runs on a single RTX 4070.
An interactive web page that visualizes a tiny transformer with editable weights, allowing users to see how changes affect predictions in real time, aimed at helping developers understand the forward pass of an LLM.
A tweet explaining the core formula of the attention mechanism in transformer models: Q × Kᵀ computes relevance, Softmax converts to probabilities, and V delivers content, forming the foundation of modern AI.
LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.
The article discusses the arbitrariness of AI model creation, proposing to draw inspiration from physics models, build a repository of candidate models, and formalize the model selection process.
A developer built a transformer model entirely from scratch in TypeScript, including a custom autograd engine, and released it as an open-source educational tool on GitHub.
A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.
This paper introduces the Proper Scoring Ensemble Filter (PSEF), a transformer-based method for Bayesian filtering that trains an analysis map using strictly proper scoring rules on synthetic state-observation trajectories. It demonstrates superior performance in nonlinear, non-Gaussian filtering tasks compared to classical and learning-based methods.
This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.
Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.
PMDformer introduces patch-mean decoupling and specialized attention mechanisms to improve shape similarity modeling in long-term time series forecasting, outperforming existing methods on multiple benchmarks.
Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.