Tag
This paper theoretically characterizes the minimax risk of KV cache compression in transformers, providing design principles for accurate compression under causal masking, and instantiates them in a practical algorithm with promising results on LongBench.
Presents a closed-form reduced-order model of GRPO training dynamics, reducing it to a damped oscillator and deriving predictions for stability, group-size invariance, and loss curvature. Validated across multiple models and benchmarks.
This paper theoretically analyzes how curriculum learning, by decomposing complex problems into simpler sub-problems and composing solutions, can dramatically reduce the sample complexity of learning to simulate sequential computations (semiautomata) compared to direct methods, achieving subpolynomial supervision requirements in supervised fine-tuning and exponentially weaker coverage conditions in reinforcement learning with verifiable rewards.
This paper derives a scaling law for sketched linear contrastive learning under a Gaussian latent-variable model, analyzing how risk decomposes into approximation, optimization, and statistical terms, and provides theoretical guidance for balancing model size, data, and compute in contrastive learning.
This paper provides a theoretical analysis of deep transformers' ability to model hierarchical structures using bounded-depth context-free grammars, constructing explicit positional-attention transformers that encode grammatical states in linearly separable subspaces.
This paper extends empirical findings that the Mahalanobis cosine similarity (MCS) between linear probes linearly predicts out-of-distribution AUROC, and proves this relationship theoretically under Gaussian assumptions.
This paper demonstrates that two-layer neural networks trained with gradient-based methods can achieve the optimal computational-statistical tradeoff for learning Gaussian single-index models, matching the SQ lower bound up to polylogarithmic factors for all generative exponents and extending to sparse settings with a novel weight perturbation technique.
This paper introduces finite certificates for verifying determinacy and emergence in language model in-context behavior, providing theoretical criteria and experimental validation on contemporary models.
This paper establishes a theoretical framework showing that smooth activations in deep neural networks can mitigate the curse of dimensionality in uniform convergence, providing non-asymptotic guarantees and outperforming ReLU networks in worst-case reliability.
This paper investigates how class label encoding influences neural collapse in neural network classifiers, showing that with one-hot encoding and balanced data, uncentered mean features transition from a simplex equiangular tight frame to an orthogonal frame as bias regularization increases.
This theoretical paper analyzes the expressivity of padded transformers, showing that attention type, width, and uniformity have little impact compared to numeric precision and model depth. It establishes equivalences between transformer variants and circuit complexity classes like AC0 and TC0, providing a robust characterization.
This paper proposes a strategic robustness objective for learning simulators in model-based reinforcement learning, formulated as a minimax game between a model player and an adversarial policy player. Theoretical guarantees and a provably convergent algorithm are provided, with experiments showing reduced prediction error and improved real-world policy transfer.
This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.
This paper derives batch scaling laws for sketched linear regression under power-law spectra, analyzing one-pass and multi-pass mini-batch SGD. It provides explicit risk decompositions showing how batch size affects bias, variance, and fluctuation terms, and establishes that without-replacement sampling yields lower noise than with-replacement.
This paper theoretically characterizes the representational capacity of Neural Process (NP) architectures, proving a strict hierarchy among Conditional, Attentive, Convolutional, and Transformer NPs, and showing that finite-dimensional latent variables do not expand representational capacity beyond the encoder.
This paper formalizes reasoning redundancy in LLMs as the fraction of trailing steps that can be truncated without affecting correctness, quantifying 61-93% redundancy across frontier models and proving that redundancy is a structural consequence of length-agnostic outcome rewards.
This paper develops a systematic framework for establishing universality of machine learning models that handle inputs of varying dimensions (e.g., graphs with different node counts). It shows that many existing architectures fail to be universal and proposes simple modifications to restore universality.
This paper investigates the 'small-vs-large gap', where training on fewer samples with more repetitions can lead to faster learning and compute savings compared to using larger datasets, attributing the speedup to layer-wise growth enabled by sampling biases. The findings suggest that smaller datasets with repetition can be proactively leveraged as favorable inductive biases, particularly in reasoning tasks.
This paper proposes Lossless Anti-Distillation Sampling (LADS), a novel sampling scheme that counters multi-account distillation by correlating responses across accounts while preserving exact statistical fidelity for individual benign users. Theoretical analysis and experiments show LADS degrades distilled student performance on image, math, and code generation.
This paper analyzes the global distributional behavior induced by iterative masked-token resampling in masked language models using Glauber dynamics. It introduces a rectangle test for incompatibility, establishes mixing time bounds, and empirically demonstrates phase transitions and metastable semantic basins.