The Wiola Architecture for Efficient Small Language Models

arXiv cs.AI Papers

Summary

Wiola is a novel Small Language Model architecture introducing five independently designed components—SRPE, GCLA, ATM, DSFF, and WiolaRMSNorm—aimed at improving efficiency and coherence, released in sizes from 120M to 1.5B parameters and integrated with HuggingFace Transformers.

arXiv:2607.01394v1 Announce Type: new Abstract: We present Wiola, a fully original Small Language Model (SLM) architecture built from first principles, sharing no structural lineage with any existing model family including GPT, LLaMA, Mistral, or Falcon. Wiola introduces five independently novel components: (i) Spiral Rotary Positional Encoding (SRPE), which embeds token positions on a three-dimensional helical manifold combining absolute, relative, and hierarchical positional signals; (ii) Gated Cross-Layer Attention (GCLA), providing each decoder layer with soft cross-attention access to compressed summaries of two preceding layers for inter-layer coherence; (iii) Adaptive Token Merging (ATM), which dynamically merges se mantically redundant adjacent tokens in middle network layers to reduce attention complexity without information loss; (iv) Dual Stream Feed-Forward (DSFF), replacing the conventional MLP with two parallel streams fused by a learned per-dimension gate; and (v) WiolaRMSNorm, a modified normalisation introducing a per-dimension learned offset vector that prevents representation collapse. We provide complete mathematical derivations, architectural block diagrams, complexity analyses, and systematic comparisons against GPT-2, LLaMA-2, and Mistral. Wiola is released in four sizes (120M, 360M, 700M, and 1.5B parameters) and is fully compatible with the HuggingFace Transformers ecosystem, with all 22 architectural unit tests passing.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:44 AM

# The Wiola Architecture for Efficient Small Language Models ††thanks: This work was conducted as an independent research contribution. No external funding was received.
Source: [https://arxiv.org/html/2607.01394](https://arxiv.org/html/2607.01394)
###### Abstract

We present Wiola, a fully original Small Language Model \(SLM\) architecture built from first principles, sharing no structural lineage with any existing model family including GPT, LLaMA, Mistral, or Falcon\. Wiola introduces five independently novel components: \(i\) Spiral Rotary Positional Encoding \(SRPE\), which embeds token positions on a three\-dimensional helical manifold combining absolute, relative, and hierarchical positional signals; \(ii\) Gated Cross\-Layer Attention \(GCLA\), providing each decoder layer with soft cross\-attention access to compressed summaries of two preceding layers for inter\-layer coherence; \(iii\) Adaptive Token Merging \(ATM\), which dynamically merges semantically redundant adjacent tokens in middle network layers to reduce attention complexity without information loss; \(iv\) Dual\-Stream Feed\-Forward \(DSFF\), replacing the conventional MLP with two parallel streams fused by a learned per\-dimension gate; and \(v\) WiolaRMSNorm, a modified normalisation introducing a per\-dimension learned offset vector that prevents representation collapse\. We provide complete mathematical derivations, architectural block diagrams, complexity analyses, and systematic comparisons against GPT\-2, LLaMA\-2, and Mistral\. Wiola is released in four sizes \(120M, 360M, 700M, and 1\.5B parameters\) and is fully compatible with the HuggingFace Transformers ecosystem, with all 22 architectural unit tests passing\.

## IIntroduction

The Transformer\[[1](https://arxiv.org/html/2607.01394#bib.bib1)\]has driven remarkable progress in natural language processing\. Yet the dominant model families—GPT\[[2](https://arxiv.org/html/2607.01394#bib.bib2)\], LLaMA\[[4](https://arxiv.org/html/2607.01394#bib.bib4)\], Mistral\[[5](https://arxiv.org/html/2607.01394#bib.bib5)\], and their derivatives—share the same structural lineage with incremental differences in positional encoding or attention grouping\. This conservatism leaves open fundamental architectural questions: Can a different positional geometry better capture multi\-scale linguistic structure? Can inter\-layer information routing improve long\-range coherence in generated text? Can token\-level redundancy be exploited to reduce quadratic attention cost?

Wiolais a clean\-slate SLM that addresses all three questions through five novel architectural components\. Every sub\-component is derived from independent mathematical principles and verified to be structurally distinct from all prior published formulations\.

The primary contributions of this work are:

1. 1\.SRPE: A 3D helical positional encoding combining absolute, relative, and hierarchical position on a unified manifold with no extra parameters\.
2. 2\.GCLA: Gated cross\-layer attention providing inter\-layer coherence via compressed layer summaries at negligible compute overhead\.
3. 3\.ATM: Dynamic greedy token merging in middle layers reducing attention FLOPs by 5–9% during training with exact length restoration\.
4. 4\.DSFF: A dual\-stream parallel FFN with per\-dimension learned fusion, separating local and global feature extraction\.
5. 5\.WiolaRMSNorm: Modified RMS normalisation with per\-dimension offset that counteracts representation collapse in deep stacks\.
6. 6\.Aproduction implementationwith 22 passing unit tests and full HuggingFace Hub integration\.

## IIRelated Work

### II\-APositional Encoding

Absolute sinusoidal encodings\[[1](https://arxiv.org/html/2607.01394#bib.bib1)\]and learnable absolute encodings\[[3](https://arxiv.org/html/2607.01394#bib.bib3)\]cannot generalise beyond training length\. Relative encodings such as ALiBi\[[7](https://arxiv.org/html/2607.01394#bib.bib7)\]and T5\-bias\[[8](https://arxiv.org/html/2607.01394#bib.bib8)\]encode pairwise offsets in attention logits\. RoPE\[[6](https://arxiv.org/html/2607.01394#bib.bib6)\]encodes position as a complex\-valued rotation ensuring attention depends only on relative offsetp−qp\-q\. Extensions \(YaRN\[[9](https://arxiv.org/html/2607.01394#bib.bib9)\]\) reparameterise the same flat 2D circle\. Wiola’s SRPE is the first encoding to place positions on a 3D helix with dual winding angles and a sinusoidal radial component, encoding multi\-scale structure analytically without learned parameters\.

### II\-BAttention Variants

Multi\-query attention \(MQA\)\[[11](https://arxiv.org/html/2607.01394#bib.bib11)\]and grouped query attention \(GQA\)\[[10](https://arxiv.org/html/2607.01394#bib.bib10)\]reduce KV\-cache memory\. Sliding window attention\[[5](https://arxiv.org/html/2607.01394#bib.bib5)\]limits quadratic cost to a local window\. Cross\-attention between layers exists in encoder\-decoder models but not in decoder\-only autoregressive LMs\. GCLA is the first formulation injecting cross\-attention from compressed*prior\-layer summaries*into each decoder layer\.

### II\-CFeed\-Forward Networks

SwiGLU\[[12](https://arxiv.org/html/2607.01394#bib.bib12)\]and GELU\[[13](https://arxiv.org/html/2607.01394#bib.bib13)\]variants of the single\-stream MLP are ubiquitous\. Mixture\-of\-Experts \(MoE\)\[[15](https://arxiv.org/html/2607.01394#bib.bib15)\]routes tokens sparsely to expert FFNs\. DSFF is distinct: two parallel*dense*streams of different widths and activations fused by a learned per\-dimension gate—not sparse routing, not a single stream\.

### II\-DToken Compression

Token merging for vision transformers \(ToMe\[[16](https://arxiv.org/html/2607.01394#bib.bib16)\]\) uses bipartite matching\. ATM applies adjacent\-token cosine\-similarity merging to language model hidden states in the middle third of a causal decoder—a transfer not previously explored\.

## IIINotation

Scalars: italic \(x,d,Tx,d,T\)\. Vectors: bold lower\-case \(𝒙\\bm\{x\}\)\. Matrices: bold upper\-case \(𝐖\\mathbf\{W\}\)\. Concatenation:\[𝒂;𝒃\]\[\\bm\{a\};\\bm\{b\}\]\. Element\-wise product:⊙\\odot\. Sigmoid:σ​\(x\)=\(1\+e−x\)−1\\sigma\(x\)=\(1\+e^\{\-x\}\)^\{\-1\}\.\[n\]≜\{0,…,n−1\}\[n\]\\triangleq\\\{0,\\ldots,n\-1\\\}\.

Table[I](https://arxiv.org/html/2607.01394#S3.T1)lists the core hyperparameter symbols and their default values for the wiola\-360m configuration\.

TABLE I:Core Hyperparameter Symbols \(wiola\-360m defaults\)
## IVWiola Architecture

### IV\-AMacro Structure

Wiola is an autoregressive decoder\-only LM\. Token IDs are embedded into𝐗\(0\)∈ℝT×d\\mathbf\{X\}^\{\(0\)\}\\in\\mathbb\{R\}^\{T\\times d\}, passed throughLLdecoder layers, normalised, and projected to logits by a tied linear head\. For layerℓ∈\[L\]\\ell\\in\[L\]:

𝐗~\(ℓ\)\\displaystyle\\tilde\{\\mathbf\{X\}\}^\{\(\\ell\)\}=WRMSNormℓ⁡\(𝐗\(ℓ\)\),\\displaystyle=\\operatorname\{WRMSNorm\}\_\{\\ell\}\\\!\\left\(\\mathbf\{X\}^\{\(\\ell\)\}\\right\),\(1\)𝐀\(ℓ\)\\displaystyle\\mathbf\{A\}^\{\(\\ell\)\}=GCLAℓ⁡\(𝐗~\(ℓ\),𝒞\(ℓ\)\),\\displaystyle=\\operatorname\{GCLA\}\_\{\\ell\}\\\!\\left\(\\tilde\{\\mathbf\{X\}\}^\{\(\\ell\)\},\\mathcal\{C\}^\{\(\\ell\)\}\\right\),\(2\)𝐗\(ℓ\+12\)\\displaystyle\\mathbf\{X\}^\{\(\\ell\+\\frac\{1\}\{2\}\)\}=𝐗\(ℓ\)\+𝐀\(ℓ\),\\displaystyle=\\mathbf\{X\}^\{\(\\ell\)\}\+\\mathbf\{A\}^\{\(\\ell\)\},\(3\)𝐗^\(ℓ\)\\displaystyle\\hat\{\\mathbf\{X\}\}^\{\(\\ell\)\}=WRMSNormℓ′⁡\(𝐗\(ℓ\+12\)\),\\displaystyle=\\operatorname\{WRMSNorm\}\_\{\\ell\}^\{\\prime\}\\\!\\left\(\\mathbf\{X\}^\{\(\\ell\+\\frac\{1\}\{2\}\)\}\\right\),\(4\)𝐅\(ℓ\)\\displaystyle\\mathbf\{F\}^\{\(\\ell\)\}=DSFFℓ⁡\(𝐗^\(ℓ\)\),\\displaystyle=\\operatorname\{DSFF\}\_\{\\ell\}\\\!\\left\(\\hat\{\\mathbf\{X\}\}^\{\(\\ell\)\}\\right\),\(5\)𝐗\(ℓ\+1\)\\displaystyle\\mathbf\{X\}^\{\(\\ell\+1\)\}=𝐗\(ℓ\+12\)\+𝐅\(ℓ\)\.\\displaystyle=\\mathbf\{X\}^\{\(\\ell\+\\frac\{1\}\{2\}\)\}\+\\mathbf\{F\}^\{\(\\ell\)\}\.\(6\)
ATM is inserted between \([1](https://arxiv.org/html/2607.01394#S4.E1)\) and \([2](https://arxiv.org/html/2607.01394#S4.E2)\) for middle\-third layers during training\. The output logits are:

𝐙=WRMSNormfinal⁡\(𝐗\(L\)\)​𝐖head,𝐖head=𝐄⊤∈ℝd×V\.\\mathbf\{Z\}=\\operatorname\{WRMSNorm\}\_\{\\mathrm\{final\}\}\\\!\\left\(\\mathbf\{X\}^\{\(L\)\}\\right\)\\mathbf\{W\}\_\{\\mathrm\{head\}\},\\quad\\mathbf\{W\}\_\{\\mathrm\{head\}\}=\\mathbf\{E\}^\{\\top\}\\in\\mathbb\{R\}^\{d\\times V\}\.\(7\)

### IV\-BLayer Block Diagram

Fig\.[1](https://arxiv.org/html/2607.01394#S4.F1)illustrates the complete Wiola decoder layer\.

![Refer to caption](https://arxiv.org/html/2607.01394v1/decoder.png)Figure 1:Wiola decoder layer\. Orange dashed arrow: cross\-layer summary𝒞\(ℓ\)\\mathcal\{C\}^\{\(\\ell\)\}from prior layers injected into GCLA\. ATM active during training in the middle third of layers only\.

## VWiolaRMSNorm

Standard RMSNorm\[[17](https://arxiv.org/html/2607.01394#bib.bib17)\]normalises:

RMSNorm​\(𝒙\)=𝜸⊙𝒙RMS⁡\(𝒙\),RMS⁡\(𝒙\)=1d​∑i=1dxi2\+ϵ\.\\mathrm\{RMSNorm\}\(\\bm\{x\}\)=\\bm\{\\gamma\}\\odot\\frac\{\\bm\{x\}\}\{\\operatorname\{RMS\}\(\\bm\{x\}\)\},\\quad\\operatorname\{RMS\}\(\\bm\{x\}\)=\\\!\\sqrt\{\\tfrac\{1\}\{d\}\\textstyle\\sum\_\{i=1\}^\{d\}x\_\{i\}^\{2\}\+\\epsilon\}\.\(8\)It cannot shift the effective zero\-point of a layer’s distribution\. Dong et al\.\[[19](https://arxiv.org/html/2607.01394#bib.bib19)\]showed that deep attention networks suffer*representation collapse*where hidden states converge to a degenerate low\-rank subspace\. Rescaling alone cannot counteract this\.

WiolaRMSNormintroduces a learned per\-dimension offset𝜹∈ℝd\\bm\{\\delta\}\\in\\mathbb\{R\}^\{d\}that shifts the*input before normalisation*:

WRMSNorm\(𝒙\)=𝜸⊙𝒙\+𝜹1d​∑i=1d\(xi\+δi\)2\+ϵ\.\\boxed\{\\operatorname\{WRMSNorm\}\(\\bm\{x\}\)=\\bm\{\\gamma\}\\odot\\frac\{\\bm\{x\}\+\\bm\{\\delta\}\}\{\\sqrt\{\\tfrac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}\(x\_\{i\}\+\\delta\_\{i\}\)^\{2\}\+\\epsilon\}\}\.\}\(9\)Setting𝒛=𝒙\+𝜹\\bm\{z\}=\\bm\{x\}\+\\bm\{\\delta\}yieldsWRMSNorm⁡\(𝒙\)=𝜸⊙𝒛/RMS⁡\(𝒛\)\\operatorname\{WRMSNorm\}\(\\bm\{x\}\)=\\bm\{\\gamma\}\\odot\\bm\{z\}/\\operatorname\{RMS\}\(\\bm\{z\}\)\. Setting𝜹=𝟎\\bm\{\\delta\}=\\bm\{0\}recovers \([8](https://arxiv.org/html/2607.01394#S5.E8)\) exactly, so WiolaRMSNorm strictly generalises RMSNorm\.

The gradient with respect toδi\\delta\_\{i\}is:

∂ℒ∂δi=γir​\(∂ℒ∂x^i−zid​r2​∑kγk​∂ℒ∂x^k​zk\),r=RMS⁡\(𝒛\),\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\delta\_\{i\}\}=\\frac\{\\gamma\_\{i\}\}\{r\}\\\!\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\hat\{x\}\_\{i\}\}\-\\frac\{z\_\{i\}\}\{dr^\{2\}\}\\sum\_\{k\}\\gamma\_\{k\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\hat\{x\}\_\{k\}\}z\_\{k\}\\right\),\\quad r=\\operatorname\{RMS\}\(\\bm\{z\}\),\(10\)which is non\-zero in general, ensuring𝜹\\bm\{\\delta\}diverges from𝟎\\bm\{0\}during training\.

The per\-layer overhead isddadditional parameters \(𝜹\\bm\{\\delta\}\) over RMSNorm\. With2​L2Lnormalisations per model, total overhead is2​L​d=32,7682Ld=32\{,\}768parameters for wiola\-360m \(0\.009%0\.009\\%of total\)\.

Fig\.[2](https://arxiv.org/html/2607.01394#S5.F2)shows the data flow through WiolaRMSNorm\.

![Refer to caption](https://arxiv.org/html/2607.01394v1/RMS.png)Figure 2:WiolaRMSNorm data flow\. The offset𝜹\\bm\{\\delta\}shifts the inputbeforeRMS computation, changing the normalisation target itself rather than adding a post\-normalisation bias\.
## VISpiral Rotary Positional Encoding \(SRPE\)

### VI\-AMotivation

RoPE\[[6](https://arxiv.org/html/2607.01394#bib.bib6)\]maps positionppto a 2D rotation per dimension pair, encoding relative offset exactly but representing only one positional scale\. Natural language has at least three scales: sub\-word tokens, phrase\-level constituents \(3–15 tokens\), and discourse units \(sentences, paragraphs\)\. SRPE embeds positions on a*3D helical manifold*, encoding all three scales in a single analytic formula with no additional learned parameters\.

### VI\-BMathematical Derivation

For positionp∈\[T\]p\\in\[T\]and dimension\-pair indexj∈\[dh/2\]j\\in\[d\_\{h\}/2\]:

Step 1 — Primary inverse frequency:

ωj=θ0−2​j/dh\.\\omega\_\{j\}=\\theta\_\{0\}^\{\-2j/d\_\{h\}\}\.\(11\)
Step 2 — Dual winding angles:

θj\(1\)​\(p\)\\displaystyle\\theta\_\{j\}^\{\(1\)\}\(p\)=p​ωj,θj\(2\)​\(p\)=p​ωjks,\\displaystyle=p\\omega\_\{j\},\\quad\\theta\_\{j\}^\{\(2\)\}\(p\)=\\frac\{p\\omega\_\{j\}\}\{k\_\{s\}\},\(12\)Θj​\(p\)\\displaystyle\\Theta\_\{j\}\(p\)=p​ωj​\(1\+1ks\)\.\\displaystyle=p\\omega\_\{j\}\\\!\\left\(1\+\\tfrac\{1\}\{k\_\{s\}\}\\right\)\.\(13\)
Step 3 — Radial modulation:

rj​\(p\)=1\+as​sin⁡\(p​fs​ωj\)\.r\_\{j\}\(p\)=1\+a\_\{s\}\\sin\\\!\\left\(pf\_\{s\}\\omega\_\{j\}\\right\)\.\(14\)
Step 4 — Encoding coefficients:

cj​\(p\)\\displaystyle c\_\{j\}\(p\)=rj​\(p\)​cos⁡Θj​\(p\),\\displaystyle=r\_\{j\}\(p\)\\cos\\Theta\_\{j\}\(p\),\(15\)sj​\(p\)\\displaystyle s\_\{j\}\(p\)=rj​\(p\)​sin⁡Θj​\(p\)\.\\displaystyle=r\_\{j\}\(p\)\\sin\\Theta\_\{j\}\(p\)\.\(16\)
Step 5 — Application to queryq∈ℝdh\\bm\{q\}\\in\\mathbb\{R\}^\{d\_\{h\}\}:

SRPE\(𝒒,p\)j\\displaystyle\\operatorname\{SRPE\}\(\\bm\{q\},p\)\_\{j\}=qj​cj​\(p\)−qj\+dh/2​sj​\(p\),\\displaystyle=q\_\{j\}c\_\{j\}\(p\)\-q\_\{j\+d\_\{h\}/2\}s\_\{j\}\(p\),\(17\)SRPE\(𝒒,p\)j\+dh/2\\displaystyle\\operatorname\{SRPE\}\(\\bm\{q\},p\)\_\{j\+d\_\{h\}/2\}=qj​sj​\(p\)\+qj\+dh/2​cj​\(p\)\.\\displaystyle=q\_\{j\}s\_\{j\}\(p\)\+q\_\{j\+d\_\{h\}/2\}c\_\{j\}\(p\)\.\(18\)The same rotation is applied to keys𝒌\\bm\{k\}\.

In matrix form:SRPE⁡\(𝒒,p\)=𝐑​\(p\)​𝒒\\operatorname\{SRPE\}\(\\bm\{q\},p\)=\\mathbf\{R\}\(p\)\\bm\{q\}where𝐑​\(p\)=⨁j\[cj−sjsjcj\]\\mathbf\{R\}\(p\)=\\bigoplus\_\{j\}\\bigl\[\\begin\{smallmatrix\}c\_\{j\}&\-s\_\{j\}\\\\ s\_\{j\}&c\_\{j\}\\end\{smallmatrix\}\\bigr\]\.

Relative position property:The dot\-product contribution from pairjjis:

rj​\(p\)​rj​\(q\)​cos⁡\(Θj​\(p\)−Θj​\(q\)\),r\_\{j\}\(p\)\\,r\_\{j\}\(q\)\\cos\\\!\\bigl\(\\Theta\_\{j\}\(p\)\-\\Theta\_\{j\}\(q\)\\bigr\),\(19\)whereΘj​\(p\)−Θj​\(q\)=\(p−q\)​ωj​\(1\+1/ks\)\\Theta\_\{j\}\(p\)\-\\Theta\_\{j\}\(q\)=\(p\-q\)\\omega\_\{j\}\(1\+1/k\_\{s\}\)depends only on the relative offsetΔ=p−q\\Delta=p\-q\. The radial productrj​\(p\)​rj​\(q\)r\_\{j\}\(p\)r\_\{j\}\(q\)introduces controlled absolute\-position dependence encoding discourse structure\.

Table[II](https://arxiv.org/html/2607.01394#S6.T2)compares SRPE with RoPE\.

TABLE II:SRPE vs\. RoPE

## VIIGated Cross\-Layer Attention \(GCLA\)

### VII\-ACross\-Layer Summary Cache

After layerℓ\\ellproduces𝐗\(ℓ\+1\)∈ℝT×d\\mathbf\{X\}^\{\(\\ell\+1\)\}\\in\\mathbb\{R\}^\{T\\times d\}, a summary is formed by mean\-pooling:

𝒔\(ℓ\)=1T​∑t=1T𝐗t,:\(ℓ\+1\)∈ℝd\.\\bm\{s\}^\{\(\\ell\)\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{X\}^\{\(\\ell\+1\)\}\_\{t,:\}\\in\\mathbb\{R\}^\{d\}\.\(20\)The context matrix for the next layer uses the most recentΛ=2\\Lambda=2summaries:

𝒞\(ℓ\+1\)=\[𝒔\(ℓ−1\);𝒔\(ℓ\)\]∈ℝΛ×d\.\\mathcal\{C\}^\{\(\\ell\+1\)\}=\\bigl\[\\bm\{s\}^\{\(\\ell\-1\)\};\\bm\{s\}^\{\(\\ell\)\}\\bigr\]\\in\\mathbb\{R\}^\{\\Lambda\\times d\}\.\(21\)

### VII\-BSelf\-Attention with SRPE and GQA

Projections:𝐐=𝐗~​𝐖Q\\mathbf\{Q\}=\\tilde\{\\mathbf\{X\}\}\\mathbf\{W\}\_\{Q\},𝐊=𝐗~​𝐖K\\mathbf\{K\}=\\tilde\{\\mathbf\{X\}\}\\mathbf\{W\}\_\{K\},𝐕=𝐗~​𝐖V\\mathbf\{V\}=\\tilde\{\\mathbf\{X\}\}\\mathbf\{W\}\_\{V\}, with𝐖Q∈ℝd×H​dh\\mathbf\{W\}\_\{Q\}\\in\\mathbb\{R\}^\{d\\times Hd\_\{h\}\}and𝐖K,𝐖V∈ℝd×Hkv​dh\\mathbf\{W\}\_\{K\},\\mathbf\{W\}\_\{V\}\\in\\mathbb\{R\}^\{d\\times H\_\{\\mathrm\{kv\}\}d\_\{h\}\}\.

SRPE applied per\-head:𝐐~h=SRPE⁡\(𝐐h\)\\tilde\{\\mathbf\{Q\}\}\_\{h\}=\\operatorname\{SRPE\}\(\\mathbf\{Q\}\_\{h\}\),𝐊~h=SRPE⁡\(𝐊h\)\\tilde\{\\mathbf\{K\}\}\_\{h\}=\\operatorname\{SRPE\}\(\\mathbf\{K\}\_\{h\}\)\.

Causal self\-attention for headhh, GQA groupg=hmodHkvg=h\\bmod H\_\{\\mathrm\{kv\}\}:

𝐀h\\displaystyle\\mathbf\{A\}\_\{h\}=softmax⁡\(𝐐~h​𝐊~g⊤\+𝐌dh\),\\displaystyle=\\operatorname\{softmax\}\\\!\\left\(\\frac\{\\tilde\{\\mathbf\{Q\}\}\_\{h\}\\tilde\{\\mathbf\{K\}\}\_\{g\}^\{\\top\}\+\\mathbf\{M\}\}\{\\sqrt\{d\_\{h\}\}\}\\right\),\(22\)𝐎hself\\displaystyle\\mathbf\{O\}\_\{h\}^\{\\mathrm\{self\}\}=𝐀h​𝐕g,\\displaystyle=\\mathbf\{A\}\_\{h\}\\mathbf\{V\}\_\{g\},\(23\)where𝐌\\mathbf\{M\}is the causal mask \(−∞\-\\inftyabove diagonal\)\.

### VII\-CCross\-Layer Context Sub\-Attention

𝐊ctx\\displaystyle\\mathbf\{K\}^\{\\mathrm\{ctx\}\}=𝒞\(ℓ\)​𝐖Kctx∈ℝΛ×Hkv​dh,\\displaystyle=\\mathcal\{C\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{K\}^\{\\mathrm\{ctx\}\}\\in\\mathbb\{R\}^\{\\Lambda\\times H\_\{\\mathrm\{kv\}\}d\_\{h\}\},\(24\)𝐕ctx\\displaystyle\\mathbf\{V\}^\{\\mathrm\{ctx\}\}=𝒞\(ℓ\)​𝐖Vctx∈ℝΛ×Hkv​dh,\\displaystyle=\\mathcal\{C\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{V\}^\{\\mathrm\{ctx\}\}\\in\\mathbb\{R\}^\{\\Lambda\\times H\_\{\\mathrm\{kv\}\}d\_\{h\}\},\(25\)𝐎hctx\\displaystyle\\mathbf\{O\}\_\{h\}^\{\\mathrm\{ctx\}\}=softmax⁡\(𝐐~h​\(𝐊gctx\)⊤dh\)​𝐕gctx\.\\displaystyle=\\operatorname\{softmax\}\\\!\\left\(\\frac\{\\tilde\{\\mathbf\{Q\}\}\_\{h\}\(\\mathbf\{K\}\_\{g\}^\{\\mathrm\{ctx\}\}\)^\{\\top\}\}\{\\sqrt\{d\_\{h\}\}\}\\right\)\\mathbf\{V\}\_\{g\}^\{\\mathrm\{ctx\}\}\.\(26\)

### VII\-DContext Blending and Output Gate

Scalar gateβ=σ​\(ϕ\)\\beta=\\sigma\(\\phi\),ϕ\\phiinitialised at−3\-3\(soβ0≈0\.047\\beta\_\{0\}\\approx 0\.047\):

𝐎h=\(1−β\)​𝐎hself\+β​𝐎hctx\.\\mathbf\{O\}\_\{h\}=\(1\-\\beta\)\\mathbf\{O\}\_\{h\}^\{\\mathrm\{self\}\}\+\\beta\\mathbf\{O\}\_\{h\}^\{\\mathrm\{ctx\}\}\.\(27\)Sigmoid output gate on merged heads𝐎=\[𝐎1;…;𝐎H\]\\mathbf\{O\}=\[\\mathbf\{O\}\_\{1\};\\ldots;\\mathbf\{O\}\_\{H\}\]:

𝐆\\displaystyle\\mathbf\{G\}=σ​\(𝐗~​𝐖gate\)∈ℝT×H​dh,\\displaystyle=\\sigma\\\!\\left\(\\tilde\{\\mathbf\{X\}\}\\mathbf\{W\}\_\{\\mathrm\{gate\}\}\\right\)\\in\\mathbb\{R\}^\{T\\times Hd\_\{h\}\},\(28\)𝐀\(ℓ\)\\displaystyle\\mathbf\{A\}^\{\(\\ell\)\}=\(𝐆⊙𝐎\)​𝐖O\.\\displaystyle=\(\\mathbf\{G\}\\odot\\mathbf\{O\}\)\\mathbf\{W\}\_\{O\}\.\(29\)
The context attention adds2​B​T​Λ​H​dh2BT\\Lambda Hd\_\{h\}FLOPs per layer, which isΛ/T=2/2048≈0\.1%\\Lambda/T=2/2048\\approx 0\.1\\%of the self\-attention cost—asymptotically negligible\.

Fig\.[3](https://arxiv.org/html/2607.01394#S7.F3)shows the GCLA data flow\.

![Refer to caption](https://arxiv.org/html/2607.01394v1/GCLA.png)Figure 3:GCLA data flow\. Queries attend to local KV \(self\-attention\) and cross\-layer context𝒞\(ℓ\)\\mathcal\{C\}^\{\(\\ell\)\}\. Scalarβ\\betablends both paths, while gate𝐆\\mathbf\{G\}provides multiplicative output control\.

## VIIIAdaptive Token Merging \(ATM\)

### VIII\-ACosine Similarity Criterion

For hidden states𝐗∈ℝT×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{T\\times d\}, the cosine similarity between adjacent tokensttandt\+1t\+1is:

ρt=𝒙^t⋅𝒙^t\+1,𝒙^t=𝒙t/‖𝒙t‖,t=1,…,T−1\.\\rho\_\{t\}=\\hat\{\\bm\{x\}\}\_\{t\}\\cdot\\hat\{\\bm\{x\}\}\_\{t\+1\},\\quad\\hat\{\\bm\{x\}\}\_\{t\}=\\bm\{x\}\_\{t\}/\\\|\\bm\{x\}\_\{t\}\\\|,\\quad t=1,\\ldots,T\-1\.\(30\)

### VIII\-BGreedy Non\-Overlapping Merge

The merge algorithm \(Algorithm[1](https://arxiv.org/html/2607.01394#alg1)\) scans left\-to\-right, averaging pairs withρt\>τ\\rho\_\{t\}\>\\tau:

𝒙k′=12​\(𝒙t\+𝒙t\+1\)if​ρt\>τ\.\\bm\{x\}^\{\\prime\}\_\{k\}=\\tfrac\{1\}\{2\}\(\\bm\{x\}\_\{t\}\+\\bm\{x\}\_\{t\+1\}\)\\quad\\text\{if \}\\rho\_\{t\}\>\\tau\.\(31\)A merge mapℳ=\{Gk\}k=1T′\\mathcal\{M\}=\\\{G\_\{k\}\\\}\_\{k=1\}^\{T^\{\\prime\}\}records source positionsGk⊆\[T\]G\_\{k\}\\subseteq\[T\],\|Gk\|∈\{1,2\}\|G\_\{k\}\|\\in\\\{1,2\\\}\.

Algorithm 1ATM Greedy Merge0:

𝐗∈ℝT×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{T\\times d\}, threshold

τ\\tau
0:Merged

𝐗′∈ℝT′×d\\mathbf\{X\}^\{\\prime\}\\in\\mathbb\{R\}^\{T^\{\\prime\}\\times d\}, merge map

ℳ\\mathcal\{M\}
1:Compute

ρt=𝒙^t⋅𝒙^t\+1\\rho\_\{t\}=\\hat\{\\bm\{x\}\}\_\{t\}\\cdot\\hat\{\\bm\{x\}\}\_\{t\+1\}for all

tt
2:

𝒳′←\[\]\\mathcal\{X\}^\{\\prime\}\\leftarrow\[\];

ℳ←\[\]\\mathcal\{M\}\\leftarrow\[\];

i←0i\\leftarrow 0
3:while

i<Ti<Tdo

4:if

i<T−1i<T\{\-\}1and

ρi\>τ\\rho\_\{i\}\>\\tauthen

5:Append

\(𝒙i\+𝒙i\+1\)/2\(\\bm\{x\}\_\{i\}\+\\bm\{x\}\_\{i\+1\}\)/2to

𝒳′\\mathcal\{X\}^\{\\prime\}
6:Append

\(i,i\+1\)\(i,i\{\+\}1\)to

ℳ\\mathcal\{M\};

i←i\+2i\\leftarrow i\+2
7:else

8:Append

𝒙i\\bm\{x\}\_\{i\}to

𝒳′\\mathcal\{X\}^\{\\prime\}
9:Append

\(i,\)\(i,\)to

ℳ\\mathcal\{M\};

i←i\+1i\\leftarrow i\+1
10:endif

11:endwhile

12:return

𝐗′←stack​\(𝒳′\)\\mathbf\{X\}^\{\\prime\}\\leftarrow\\mathrm\{stack\}\(\\mathcal\{X\}^\{\\prime\}\),

ℳ\\mathcal\{M\}

### VIII\-CUnmerge Restoration

After attention produces𝐗^′∈ℝT′×d\\hat\{\\mathbf\{X\}\}^\{\\prime\}\\in\\mathbb\{R\}^\{T^\{\\prime\}\\times d\}, the original length is restored:

x^t=x^k′∀t∈Gk,k∈\[T′\]\.\\hat\{x\}\_\{t\}=\\hat\{x\}^\{\\prime\}\_\{k\}\\quad\\forall t\\in G\_\{k\},\\quad k\\in\[T^\{\\prime\}\]\.\(32\)

### VIII\-DComplexity Analysis

With merge ratioμ=1−T′/T\\mu=1\-T^\{\\prime\}/T, the FLOPs saving per active layer is:

Δ​C=1−\(1−μ\)2=μ​\(2−μ\)\.\\Delta C=1\-\(1\-\\mu\)^\{2\}=\\mu\(2\-\\mu\)\.\(33\)Forτ=0\.92\\tau=0\.92, empiricalμ≈0\.08\\mu\\approx 0\.08–0\.140\.14, givingΔ​C≈15\\Delta C\\approx 15–26%26\\%per active layer\. Applied toL/3L/3layers, total training FLOPs reduction is approximately55–9%9\\%\.

ATM is active only during training; disabled at inference to maintain KV\-cache consistency\.

## IXDual\-Stream Feed\-Forward \(DSFF\)

### IX\-AFormulation

DSFF uses two parallel dense streams fused by a per\-dimension learned gate\.

Stream A\(local patterns, SwiGLU, narrow widthdAd\_\{A\}\):

𝒂=𝐃A​\(SiLU⁡\(𝐆A​𝒙\)⊙𝐔A​𝒙\)∈ℝd,\\bm\{a\}=\\mathbf\{D\}\_\{A\}\\\!\\left\(\\operatorname\{SiLU\}\(\\mathbf\{G\}\_\{A\}\\bm\{x\}\)\\odot\\mathbf\{U\}\_\{A\}\\bm\{x\}\\right\)\\in\\mathbb\{R\}^\{d\},\(34\)where𝐔A,𝐆A∈ℝd×dA\\mathbf\{U\}\_\{A\},\\mathbf\{G\}\_\{A\}\\in\\mathbb\{R\}^\{d\\times d\_\{A\}\},𝐃A∈ℝdA×d\\mathbf\{D\}\_\{A\}\\in\\mathbb\{R\}^\{d\_\{A\}\\times d\}\.

Stream B\(global semantics, GELU, wide widthdB≫dAd\_\{B\}\\gg d\_\{A\}\):

𝒃=𝐃B​\(GELU⁡\(𝐔B​𝒙\)\)∈ℝd,\\bm\{b\}=\\mathbf\{D\}\_\{B\}\\\!\\left\(\\operatorname\{GELU\}\(\\mathbf\{U\}\_\{B\}\\bm\{x\}\)\\right\)\\in\\mathbb\{R\}^\{d\},\(35\)where𝐔B∈ℝd×dB\\mathbf\{U\}\_\{B\}\\in\\mathbb\{R\}^\{d\\times d\_\{B\}\},𝐃B∈ℝdB×d\\mathbf\{D\}\_\{B\}\\in\\mathbb\{R\}^\{d\_\{B\}\\times d\}\.

Per\-dimension fusion gate:

𝜶=σ​\(𝐖f​\[𝒂;𝒃\]\)∈\(0,1\)d,𝐖f∈ℝ2​d×d\.\\bm\{\\alpha\}=\\sigma\\\!\\left\(\\mathbf\{W\}\_\{f\}\[\\bm\{a\};\\bm\{b\}\]\\right\)\\in\(0,1\)^\{d\},\\quad\\mathbf\{W\}\_\{f\}\\in\\mathbb\{R\}^\{2d\\times d\}\.\(36\)
Fused output:

DSFF⁡\(𝒙\)=𝜶⊙𝒂\+\(1−𝜶\)⊙𝒃\.\\operatorname\{DSFF\}\(\\bm\{x\}\)=\\bm\{\\alpha\}\\odot\\bm\{a\}\+\(1\-\\bm\{\\alpha\}\)\\odot\\bm\{b\}\.\(37\)
Setting𝐖f=𝟎\\mathbf\{W\}\_\{f\}=\\mathbf\{0\}gives𝜶=0\.5​𝟏\\bm\{\\alpha\}=0\.5\\bm\{1\}, reducing \([37](https://arxiv.org/html/2607.01394#S9.E37)\) to a simple ensemble average\. DSFF strictly generalises stream ensemble\.

The SiLU activation\[[14](https://arxiv.org/html/2607.01394#bib.bib14)\]used in Stream A:SiLU⁡\(x\)=x​σ​\(x\)\\operatorname\{SiLU\}\(x\)=x\\sigma\(x\)provides sharp, non\-monotonic gating suited to local discriminative patterns\. GELU\[[13](https://arxiv.org/html/2607.01394#bib.bib13)\]in Stream B provides smooth activation suited to superposing many weakly\-active semantic features\.

Fig\.[4](https://arxiv.org/html/2607.01394#S9.F4)shows the DSFF data flow\.

![Refer to caption](https://arxiv.org/html/2607.01394v1/DSFF.png)Figure 4:DSFF data flow\. Stream A \(purple\): narrow SwiGLU for local patterns\. Stream B \(teal\): wide GELU for global semantics\. Gate𝜶∈\(0,1\)d\\bm\{\\alpha\}\\in\(0,1\)^\{d\}is per\-dimension and input\-dependent, computed from concatenated stream outputs\.

## XModel Variants and Parameter Budgets

Table[III](https://arxiv.org/html/2607.01394#S10.T3)summarises the four Wiola size variants\. Table[IV](https://arxiv.org/html/2607.01394#S10.T4)gives the full parameter budget for wiola\-360m\.

TABLE III:Wiola Model FamilyTABLE IV:Parameter Budget: wiola\-360m \(d=1024d\\\!=\\\!1024,L=16L\\\!=\\\!16,V=32000V\\\!=\\\!32000\)
## XIComputational Complexity

The KV\-cache memory for inference at sequence positionttis:

MKV=2​L​Hkv​dh​t⋅bdtype,M\_\{\\mathrm\{KV\}\}=2LH\_\{\\mathrm\{kv\}\}d\_\{h\}t\\cdot b\_\{\\mathrm\{dtype\}\},\(38\)wherebdtype=2b\_\{\\mathrm\{dtype\}\}=2bytes \(BF16\)\. For wiola\-360m att=2048t=2048:MKV=2×16×4×64×2048×2=67\.1M\_\{\\mathrm\{KV\}\}=2\\times 16\\times 4\\times 64\\times 2048\\times 2=67\.1MB\.

Per\-layer attention FLOPs for MHA, GQA, and GCLA:

CMHA\\displaystyle C\_\{\\mathrm\{MHA\}\}=4​B​T2​H​dh,\\displaystyle=4BT^\{2\}Hd\_\{h\},\(39\)CGQA\\displaystyle C\_\{\\mathrm\{GQA\}\}=2​B​T2​\(H\+Hkv\)​dh,\\displaystyle=2BT^\{2\}\(H\+H\_\{\\mathrm\{kv\}\}\)d\_\{h\},\(40\)CGCLA\\displaystyle C\_\{\\mathrm\{GCLA\}\}=CGQA\+2​B​T​Λ​H​dh\.\\displaystyle=C\_\{\\mathrm\{GQA\}\}\+2BT\\Lambda Hd\_\{h\}\.\(41\)The GCLA overhead2​B​T​Λ​H​dh2BT\\Lambda Hd\_\{h\}over GQA equalsΛ/T≈0\.1%\\Lambda/T\\approx 0\.1\\%of self\-attention cost atT=2048T=2048,Λ=2\\Lambda=2\.

## XIISystematic Architectural Comparison

Table[V](https://arxiv.org/html/2607.01394#S12.T5)classifies each Wiola component as Novel \(N\) or Shared \(S\) relative to five architectures, where novel means mathematically distinct formulation—not merely a change in hyperparameter values\.

TABLE V:Component Novelty Matrix \(N=Novel, S=Shared\)Table[VI](https://arxiv.org/html/2607.01394#S12.T6)provides a detailed architectural comparison\.

TABLE VI:Detailed Architectural ComparisonTable[VII](https://arxiv.org/html/2607.01394#S12.T7)compares KV\-cache footprints\.

TABLE VII:KV\-Cache Footprint atT=2048T=2048, BF16
## XIIITraining Methodology

### XIII\-AObjective

Next\-token prediction loss:

ℒ=−1T−1​∑t=1T−1log⁡Pθ​\(xt\+1∣x≤t\)\.\\mathcal\{L\}=\-\\frac\{1\}\{T\-1\}\\sum\_\{t=1\}^\{T\-1\}\\log P\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.\(42\)

### XIII\-BOptimiser

AdamW\[[21](https://arxiv.org/html/2607.01394#bib.bib21)\]withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95,ϵ=10−8\\epsilon=10^\{\-8\}, weight decayλ=0\.1\\lambda=0\.1, gradient clipping‖∇ℒ‖2≤1\.0\\\|\\nabla\\mathcal\{L\}\\\|\_\{2\}\\leq 1\.0\.

### XIII\-CLearning Rate Schedule

Linear warmup then cosine decay overTmaxT\_\{\\max\}steps with warmupTw=0\.05​TmaxT\_\{w\}=0\.05T\_\{\\max\}:

η​\(t\)=\{ηmax​t/Twt<Tw,ηmax2​\(1\+cos⁡\(π​t−TwTmax−Tw\)\)t≥Tw\.\\eta\(t\)=\\begin\{cases\}\\eta\_\{\\max\}t/T\_\{w\}&t<T\_\{w\},\\\\ \\frac\{\\eta\_\{\\max\}\}\{2\}\\\!\\left\(1\+\\cos\\\!\\left\(\\pi\\frac\{t\-T\_\{w\}\}\{T\_\{\\max\}\-T\_\{w\}\}\\right\)\\right\)&t\\geq T\_\{w\}\.\\end\{cases\}\(43\)Peak rateηmax=3×10−4\\eta\_\{\\max\}=3\\times 10^\{\-4\}\. Gradient checkpointing\[[22](https://arxiv.org/html/2607.01394#bib.bib22)\]reduces activation memory from𝒪​\(L​d\)\\mathcal\{O\}\(Ld\)to𝒪​\(L​d\)\\mathcal\{O\}\(\\sqrt\{L\}d\)at≈33%\\approx 33\\%additional forward compute\.

Under Chinchilla scaling\[[23](https://arxiv.org/html/2607.01394#bib.bib23)\], optimal training tokensD∗≈20​ND^\{\*\}\\approx 20Nfor parameter countNN\. Table[VIII](https://arxiv.org/html/2607.01394#S13.T8)gives projections for the Wiola family\.

TABLE VIII:Chinchilla\-Optimal Training Tokens and Projected PerplexityModelParamsD∗D^\{\*\}Proj\. PPLawiola\-120m120M2\.4B18–22wiola\-360m360M7\.2B13–17wiola\-700m700M14\.0B11–14wiola\-1\.5b1\.5B30\.0B9–12aWikiText\-103 projection, English text training\.

## XIVImplementation and Verification

Wiola registersmodel\_type = "wiola"with three HuggingFace AutoClasses:AutoConfig,AutoModelForCausalLM, andAutoTokenizer\. Weights are serialised insafetensorsformat for zero\-copy memory\-mapped loading\. Weight tying \(𝐖head=𝐄⊤\\mathbf\{W\}\_\{\\mathrm\{head\}\}=\\mathbf\{E\}^\{\\top\}\) saves 65\.5 MB for wiola\-360m\.

The tokenizer uses BPE\[[24](https://arxiv.org/html/2607.01394#bib.bib24)\]with byte\-level fallback \(NFC\-normalised Unicode pre\-tokenisation\), guaranteeing zero unknown tokens for any input\. The chat template encodes turns as:<\|user\|\>𝒰\\mathcal\{U\}<\|end\|\><\|assistant\|\>𝒜\\mathcal\{A\}<\|end\|\>\.

Table[IX](https://arxiv.org/html/2607.01394#S14.T9)summarises the test coverage; all 22 tests pass\.

TABLE IX:Unit Test Coverage \(All 22 Pass\)The incremental\-match test verifies that a full forward pass and a two\-chunk cached forward pass agree withℓ∞\\ell\_\{\\infty\}error below10−410^\{\-4\}\(BF16 precision bound\)\.

## XVDiscussion

SRPE vs\. extending RoPE\.YaRN\[[9](https://arxiv.org/html/2607.01394#bib.bib9)\]and LongRoPE reparameterise the same flat 2D circle\. SRPE’s secondary angleθj\(2\)\\theta\_\{j\}^\{\(2\)\}and radialrj​\(p\)r\_\{j\}\(p\)are absent from all RoPE variants—they encode hierarchical structure geometrically, not through learned weights\.

Mean\-pool summaries\.Learned pooling adds𝒪​\(d2\)\\mathcal\{O\}\(d^\{2\}\)parameters per layer\. Max\-pool discards magnitude\. Mean\-pool is parameter\-free, differentiable, and produces a vector in the same representation space as the token hidden states\. WithΛ=2\\Lambda=2, it balances context richness against propagating early\-layer noise\.

ATM in middle layers only\.Early layers build surface\-form features; merging would conflate distinct sub\-word tokens\. Final layers must operate on the full sequence for correct next\-token prediction\. Middle layers perform high\-level semantic integration where adjacent token redundancy is highest\.

Per\-dimension fusion\.A scalar blend would apply uniformly to all output dimensions\. The per\-dimension gate𝜶∈ℝd\\bm\{\\alpha\}\\in\\mathbb\{R\}^\{d\}lets the model choose, independently per output dimension and per token, whether to draw from the local or global stream\.

Limitations\.\(i\) ATM is disabled at inference to maintain KV\-cache consistency\. \(ii\) GCLA’s layer\-to\-layer dependency complicates pipeline parallelism\. \(iii\) SRPE’s radial term may exhibit phase interference forT\>8192T\>8192\. \(iv\) Full pre\-training benchmarks are left as future work\.

## XVIConclusion

We presented Wiola, a Small Language Model built from first principles with five novel architectural components\. SRPE embeds positions on a 3D helical manifold\. GCLA provides inter\-layer coherence via compressed layer summaries\. ATM reduces training FLOPs by 5–9% through dynamic token merging\. DSFF separates local and global feature extraction via parallel streams\. WiolaRMSNorm counteracts representation collapse with a per\-dimension learned offset\.

All five components are mathematically distinct from GPT\-2, LLaMA\-2, Mistral, Phi\-3, Falcon, and Gemma as demonstrated by the novelty matrix \(Table[V](https://arxiv.org/html/2607.01394#S12.T5)\)\. The implementation is production\-ready: 22 unit tests pass, four size variants \(120M–1\.5B\) are defined, and full HuggingFace integration is provided\. The KV\-cache footprint of wiola\-360m is 67 MB at 2048 tokens—4–6×\\timessmaller than comparable MHA models\.

Future work includes pre\-training at scale, instruction fine\-tuning via DPO, INT8/INT4 quantisation studies, and extensions of ATM to support inference\-time token merging with cache\-aware restoration\.

## Acknowledgment

The authors thank the PyTorch and HuggingFace open\-source communities\.

## References

- \[1\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin, “Attention is all you need,” inAdv\. Neural Inf\. Process\. Syst\., vol\. 30, 2017\.
- \[2\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, et al\., “Language models are few\-shot learners,” inAdv\. Neural Inf\. Process\. Syst\., vol\. 33, pp\. 1877–1901, 2020\.
- \[3\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever, “Language models are unsupervised multitask learners,”OpenAI Blog, vol\. 1, no\. 8, 2019\.
- \[4\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, et al\., “Llama 2: Open foundation and fine\-tuned chat models,”arXiv:2307\.09288, 2023\.
- \[5\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, et al\., “Mistral 7B,”arXiv:2310\.06825, 2023\.
- \[6\]J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu, “RoFormer: Enhanced transformer with rotary position embedding,”Neurocomputing, vol\. 568, p\. 127063, 2024\.
- \[7\]O\. Press, N\. A\. Smith, and M\. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” inProc\. ICLR, 2022\.
- \[8\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, et al\., “Exploring the limits of transfer learning with a unified text\-to\-text transformer,”J\. Mach\. Learn\. Res\., vol\. 21, no\. 140, pp\. 1–67, 2020\.
- \[9\]B\. Peng, E\. Quesnelle, H\. Fan, and E\. Shippole, “YaRN: Efficient context window extension of large language models,”arXiv:2309\.00071, 2023\.
- \[10\]J\. Ainslie, J\. Lee\-Thorp, M\. de Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai, “GQA: Training generalised multi\-query transformer models from multi\-head checkpoints,” inProc\. EMNLP, pp\. 4895–4901, 2023\.
- \[11\]N\. Shazeer, “Fast transformer decoding: One write\-head is all you need,”arXiv:1911\.02150, 2019\.
- \[12\]N\. Shazeer, “GLU variants improve transformer,”arXiv:2002\.05202, 2020\.
- \[13\]D\. Hendrycks and K\. Gimpel, “Gaussian error linear units \(GELUs\),”arXiv:1606\.08415, 2016\.
- \[14\]S\. Elfwing, E\. Uchibe, and K\. Doya, “Sigmoid\-weighted linear units for neural network function approximation in reinforcement learning,”Neural Netw\., vol\. 107, pp\. 3–11, 2018\.
- \[15\]W\. Fedus, B\. Zoph, and N\. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J\. Mach\. Learn\. Res\., vol\. 23, no\. 120, pp\. 1–39, 2022\.
- \[16\]D\. Bolya, C\.\-Y\. Fu, X\. Dai, P\. Zhang, C\. Feichtenhofer, and J\. Hoffman, “Token merging: Your ViT but faster,” inProc\. ICLR, 2023\.
- \[17\]B\. Zhang and R\. Sennrich, “Root mean square layer normalization,” inAdv\. Neural Inf\. Process\. Syst\., vol\. 32, 2019\.
- \[18\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton, “Layer normalization,”arXiv:1607\.06450, 2016\.
- \[19\]Y\. Dong, J\.\-B\. Cordonnier, and A\. Loukas, “Attention is not all you need: Pure attention loses rank doubly exponentially with depth,” inProc\. ICML, pp\. 2793–2803, 2021\.
- \[20\]Y\. N\. Dauphin, A\. Fan, M\. Auli, and D\. Grangier, “Language modeling with gated convolutional networks,” inProc\. ICML, pp\. 933–941, 2017\.
- \[21\]I\. Loshchilov and F\. Hutter, “Decoupled weight decay regularization,” inProc\. ICLR, 2019\.
- \[22\]T\. Chen, B\. Xu, C\. Zhang, and C\. Guestrin, “Training deep nets with sublinear memory cost,”arXiv:1604\.06174, 2016\.
- \[23\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, et al\., “Training compute\-optimal large language models,” inAdv\. Neural Inf\. Process\. Syst\., vol\. 35, pp\. 30016–30030, 2022\.
- \[24\]R\. Sennrich, B\. Haddow, and A\. Birch, “Neural machine translation of rare words with subword units,” inProc\. ACL, pp\. 1715–1725, 2016\.
- \[25\]R\. Xiong, Y\. Yang, D\. He, K\. Zheng, S\. Zheng, C\. Xing, et al\., “On layer normalization in the transformer architecture,” inProc\. ICML, pp\. 10524–10533, 2020\.
- \[26\]M\. Abdin, J\. Aneja, H\. Awadalla, et al\., “Phi\-3 technical report: A highly capable language model locally on your phone,”arXiv:2404\.14219, 2024\.

Similar Articles

Little Brains, Big Feats: Exploring Compact Language Models

Hugging Face Daily Papers

This paper benchmarks 17 compact language models (1B-8B parameters) as generators in Russian-language RAG systems under CPU-only inference, finding that Qwen-family models offer strong quality-latency tradeoffs for private, GPU-free deployment.

Building Social World Models with Large Language Models

Hugging Face Daily Papers

The paper introduces the Social World Model (SWM) framework, which uses large language models to model the dynamics of social beliefs in response to events, without explicit annotations. It also presents a benchmark SWM-bench derived from prediction markets and shows state-of-the-art results.

Improved Large Language Diffusion Models

arXiv cs.CL

iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.