Tapered Language Models
Summary
This paper introduces Tapered Language Models (TLMs), an architecture principle that allocates more parameters to earlier layers and fewer to later layers, consistently improving perplexity and downstream performance across multiple architectures without extra cost.
View Cached Full Text
Cached at: 06/23/26, 05:40 AM
Paper page - Tapered Language Models
Source: https://huggingface.co/papers/2606.23670
Abstract
Tapered language models allocate more parameters to earlier layers and fewer to later layers, improving performance without increasing total parameters or compute costs.
Modern language models, includingtransformer,recurrent, andmemory-based variants, share a common chassis: a stack ofidentical layersin which parameters are allocated uniformly across depth. This is a default inherited from the originaltransformerand largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improvesperplexityover a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget.MLPsare the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smoothcosine scheduleconsistently improvesperplexityand downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establishdepth-aware capacity allocationas a simple,architecture-agnosticaxis of language model design, a free lever hidden in plain sight.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.23670
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.23670 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.23670 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.23670 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs
This paper introduces Tiered Language Models (TLMs), which allow a single set of open-weight model parameters to support multiple capability levels controlled by secret keys. The method enables selective exposure of private capabilities while preserving public model behavior and resisting extraction.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Proposes Memory-Efficient Looped Transformer (MELT), a novel recurrent LLM architecture that decouples reasoning depth from memory consumption by sharing a single KV cache across loops and using chunk-wise training with interpolated transition and attention-aligned distillation.
Parallax: Parameterized Local Linear Attention for Language Modeling
Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.
Pretraining Language Models on Historical Text
This paper introduces TypewriterLM, a 7.24B parameter language model trained exclusively on English text predating 1913, along with TypewriterCorpus (a 54B-token cleaned historical corpus) and instruction-tuning datasets to avoid temporal leakage and lookahead bias. It also presents a benchmark suite, History-Event, for evaluating temporal grounding and leakage.
@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.