theoretical-analysis

#theoretical-analysis

The risk of KV cache compression

arXiv cs.LG ↗ · 2d ago Cached

This paper theoretically characterizes the minimax risk of KV cache compression in transformers, providing design principles for accurate compression under causal masking, and instantiates them in a practical algorithm with promising results on LongBench.

0 favorites 0 likes

#theoretical-analysis

Predictable GRPO: A Closed-Form Model of Training Dynamics

arXiv cs.LG ↗ · 4d ago Cached

Presents a closed-form reduced-order model of GRPO training dynamics, reducing it to a damped oscillator and deriving predictions for stability, group-size invariance, and loss curvature. Validated across multiple models and benchmarks.

0 favorites 0 likes

#theoretical-analysis

Learning to Reason with Curriculum II: Compositional Generalization

arXiv cs.LG ↗ · 6d ago Cached

This paper theoretically analyzes how curriculum learning, by decomposing complex problems into simpler sub-problems and composing solutions, can dramatically reduce the sample complexity of learning to simulate sequential computations (semiautomata) compared to direct methods, achieving subpolynomial supervision requirements in supervised fine-tuning and exponentially weaker coverage conditions in reinforcement learning with verifiable rewards.

0 favorites 0 likes

#theoretical-analysis

Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling

arXiv cs.LG ↗ · 2026-06-26 Cached

This paper derives a scaling law for sketched linear contrastive learning under a Gaussian latent-variable model, analyzing how risk decomposes into approximation, optimization, and statistical terms, and provides theoretical guidance for balancing model size, data, and compute in contrastive learning.

0 favorites 0 likes

#theoretical-analysis

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper provides a theoretical analysis of deep transformers' ability to model hierarchical structures using bounded-depth context-free grammars, constructing explicit positional-attention transformers that encode grammatical states in linearly separable subspaces.

0 favorites 0 likes

#theoretical-analysis

Comparing Linear Probes with Mahalanobis Cosine Similarity

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

This paper extends empirical findings that the Mahalanobis cosine similarity (MCS) between linear probes linearly predicts out-of-distribution AUROC, and proves this relationship theoretically under Gaussian assumptions.

0 favorites 0 likes

#theoretical-analysis

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper demonstrates that two-layer neural networks trained with gradient-based methods can achieve the optimal computational-statistical tradeoff for learning Gaussian single-index models, matching the SQ lower bound up to polylogarithmic factors for all generative exponents and extending to sparse settings with a novel weight perturbation technique.

0 favorites 0 likes

#theoretical-analysis

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper introduces finite certificates for verifying determinacy and emergence in language model in-context behavior, providing theoretical criteria and experimental validation on contemporary models.

0 favorites 0 likes

#theoretical-analysis

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper establishes a theoretical framework showing that smooth activations in deep neural networks can mitigate the curse of dimensionality in uniform convergence, providing non-asymptotic guarantees and outperforming ReLU networks in worst-case reliability.

0 favorites 0 likes

#theoretical-analysis

The role of class encoding in neural collapse

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper investigates how class label encoding influences neural collapse in neural network classifiers, showing that with one-hot encoding and balanced data, uncentered mean features transition from a simplex equiangular tight frame to an orthogonal frame as bias regularization increases.

0 favorites 0 likes

#theoretical-analysis

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

arXiv cs.LG ↗ · 2026-06-01 Cached

This theoretical paper analyzes the expressivity of padded transformers, showing that attention type, width, and uniformity have little impact compared to numeric precision and model depth. It establishes equivalences between transformer variants and circuit complexity classes like AC0 and TC0, providing a robust characterization.

0 favorites 0 likes

#theoretical-analysis

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper proposes a strategic robustness objective for learning simulators in model-based reinforcement learning, formulated as a minimax game between a model player and an adversarial policy player. Theoretical guarantees and a provably convergent algorithm are provided, with experiments showing reduced prediction error and improved real-world policy transfer.

0 favorites 0 likes

#theoretical-analysis

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.

0 favorites 0 likes

#theoretical-analysis

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper derives batch scaling laws for sketched linear regression under power-law spectra, analyzing one-pass and multi-pass mini-batch SGD. It provides explicit risk decompositions showing how batch size affects bias, variance, and fluctuation terms, and establishes that without-replacement sampling yields lower noise than with-replacement.

0 favorites 0 likes

#theoretical-analysis

Characterizing the Representational Capacity of Neural Processes

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper theoretically characterizes the representational capacity of Neural Process (NP) architectures, proving a strict hierarchy among Conditional, Attentive, Convolutional, and Transformer NPs, and showing that finite-dimensional latent variables do not expand representational capacity beyond the encoder.

0 favorites 0 likes

#theoretical-analysis

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

arXiv cs.AI ↗ · 2026-05-26 Cached

This paper formalizes reasoning redundancy in LLMs as the fraction of trailing steps that can be truncated without affecting correctness, quantifying 61-93% redundancy across frontier models and proving that redundancy is a structural consequence of length-agnostic outcome rewards.

0 favorites 0 likes

#theoretical-analysis

Any-Dimensional Invariant Universality

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper develops a systematic framework for establishing universality of machine learning models that handle inputs of varying dimensions (e.g., graphs with different node counts). It shows that many existing architectures fail to be universal and proposes simple modifications to restore universality.

0 favorites 0 likes

#theoretical-analysis

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper investigates the 'small-vs-large gap', where training on fewer samples with more repetitions can lead to faster learning and compute savings compared to using larger datasets, attributing the speedup to layer-wise growth enabled by sampling biases. The findings suggest that smaller datasets with repetition can be proactively leveraged as favorable inductive biases, particularly in reasoning tasks.

0 favorites 0 likes

#theoretical-analysis

Lossless Anti-Distillation Sampling

arXiv cs.LG ↗ · 2026-05-20

This paper proposes Lossless Anti-Distillation Sampling (LADS), a novel sampling scheme that counters multi-account distillation by correlating responses across accounts while preserving exact statistical fidelity for individual benign users. Theoretical analysis and experiments show LADS degrades distilled student performance on image, math, and code generation.

0 favorites 0 likes

#theoretical-analysis

Mixing Times of Glauber Dynamics on Masked Language Models

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper analyzes the global distributional behavior induced by iterative masked-token resampling in masked language models using Glauber dynamics. It introduces a rectangle test for incompatibility, establishes mixing time bounds, and empirically demonstrates phase transitions and metastable semantic basins.

0 favorites 0 likes

theoretical-analysis

Submit Feedback