expressivity

#expressivity

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

arXiv cs.LG ↗ · 6d ago Cached

This theoretical paper analyzes the expressivity of padded transformers, showing that attention type, width, and uniformity have little impact compared to numeric precision and model depth. It establishes equivalences between transformer variants and circuit complexity classes like AC0 and TC0, providing a robust characterization.

0 favorites 0 likes

#expressivity

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

arXiv cs.LG ↗ · 2026-05-22 Cached

TBP-mHC introduces a novel parameterization for manifold-constrained hyper connections in residual networks, achieving full expressivity of the Birkhoff polytope with O(n^2) degrees of freedom and improved stability and scalability.

0 favorites 0 likes

#expressivity

Language Acquisition Device in Large Language Models

arXiv cs.CL ↗ · 2026-05-19 Cached

This paper proposes LAD-inspired pre-pretraining using a formal language called MP-Struct that encodes natural-language-like structures. It shows that this approach improves token efficiency and imparts human-like resistance to structurally implausible languages, challenging prior hypotheses about effective pre-pretraining languages.

0 favorites 0 likes

#expressivity

Olmo Hybrid: From Theory to Practice and Back

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper presents Olmo Hybrid, a 7B-parameter language model that combines attention and Gated DeltaNet recurrent layers, demonstrating both theoretical and empirical advantages over pure transformers. The work shows that hybrid models have greater expressivity, scale more efficiently during pretraining, and outperform comparable transformer baselines.

0 favorites 0 likes

expressivity

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

Language Acquisition Device in Large Language Models

Olmo Hybrid: From Theory to Practice and Back

Submit Feedback