TIDE: Every Layer Knows the Token Beneath the Context

arXiv cs.CL 05/08/26, 04:00 AM Papers
llm-architecture transformers token-embedding rare-tokens apple-research machine-learning
Summary
This paper introduces TIDE, a method that addresses the Rare Token and Contextual Collapse problems in LLMs by injecting token identity into every layer via Embedding Memory. The authors demonstrate theoretical and empirical improvements across language modeling and downstream tasks.
arXiv:2605.06216v1 Announce Type: new Abstract: We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/08/26, 07:24 AM
# TIDE: Every Layer Knows the Token Beneath the Context
Source: [https://arxiv.org/html/2605.06216](https://arxiv.org/html/2605.06216)
Lauren HannahHan\-Byul KimDuc HoangMehrdad FarajtabarMinsik ChoApple

\(May 7, 2026\)

###### Abstract

We revisit a universally accepted but under\-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded\. This*single\-injection assumption*induces two structural failures: \(i\) theRare Token Problem, where a Zipf\-type distribution of vocabulary causes rare\-token embeddings are chronically under\-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and \(ii\) theContextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states\. As an attempt to address both, we proposeTIDE, which augments the standard transformer withEmbeddingMemory: an ensemble ofKKindependentMemoryBlocks that map token indices to context\-free semantic vectors, computed once and injected into every layer through a depth\-conditioned softmax router with a learnable null bank\. We*theoretically*and*empirically*establish the benefits ofTIDEin addressing the issues associated with single\-token identity injection as well as*improve performance*across multiple language modeling and downstream tasks\.

## 1Introduction

Scaling modern large language models \(LLMs\) involves devoting substantial representational capacity towardscontextualizingtokens through innovating attention mechanisms, enlarging feed\-forward modules, and stacking deep transformer layers\. In contrast, a critical LLM component that has been widely overlooked in recent advancements is thetoken index\-the only piece of information that unambiguously identifies what a token is\. The token index is looked up once at the input embedding layer and then permanently discarded\. Every subsequent computation across allLLtransformer layers operates on a contextualized hidden state that never again directly consults which vocabulary entries are being processed\. This single\-injection assumption createstwodistinct failure modes:

❶The Rare Token Problem:Natural language vocabularies obey power law scaling, specifically Zipf’s law\(zipf1949human;pilgrim2021bias\): the most frequent 1% of tokens account for∼80%\\sim\\\!80\\%of corpus occurrences\. Under SGD, cumulative gradient signal for each token embedding is proportional to its frequency \(Section[2\.1](https://arxiv.org/html/2605.06216#S2.SS1.SSS0.Px1)\), leaving rare\-token embeddings \(e\.g\. rare named entities, technical terms, low\-frequency morphological forms\) persistently under\-trained \(Figure[1](https://arxiv.org/html/2605.06216#S1.F1)\)\.

❷Contextual Hidden State Collapse:During training, FFNs are forced into*representational overloading*where they simultaneously implement structural transformations of the residual stream and serve as the primary store of token\-specific factual knowledge\(meng2022rome;dai2022knowledgeneurons\)\. The token index is never re\-consulted at intermediate layers, and the only mechanism the FFNs have to differentiate two tokens at depth relies on contextual mixture of residual and attention output\. However, in case when two semantically distinct tokens appear in nearly identical syntactic environments, the context provides limited differentiating signal and their hidden states become nearly indistinguishable across the network \(Figure[2](https://arxiv.org/html/2605.06216#S2.F2)\)\.

Motivated by these challenges, we pose a critical question:How can we provide every transformer layer with persistent, token\-identity\-conditioned knowledge, independent of and complementary to the contextual residual stream?Unlike prior approaches that focus on post\-hoc analysis ofde factoFFNs\(geva2022vocabspace;meng2022rome;meng2023memit\)or retrofit external retrieval at inference time\(lewis2020rag;borgeaud2022retro;izacard2023atlas\), we adopt an alternative approach: designing and training from scratch a novel transformer architecture that maintains a dedicated semantic memory indexed directly by static token identity information\.

In this work, we proposeTIDE\(TokenIdentityDeliveredEverywhere\), an architectural modification to standard transformer that maintains a dedicated semantic memory indexed directly by token identity \(Figure[3](https://arxiv.org/html/2605.06216#S2.F3)\)\.TIDEintroducesEmbeddingMemory, an ensemble ofKKindependentMemoryBlocks each mapping token indices to static and context\-free learned semantic vectors that can be injected to each transformer layer with a persistent, token\-conditioned signal in parallel to the contextual residual stream\. Our key contributions can be summarized as:

- •Architectural\.TIDEintroduces a*token\-level unified embedding memory*that enablesKKdisjoint pathways for token\-level gradient accumulation\. The tensor from memory embeddings tensor is computed*once per forward pass*and injected into*every*transformer layer via a per\-layer softmax routing mechanism conditioned on the post\-attention hidden state\.
- •Theoretical\.We formalize the two failure mode in standard transformer and prove thatTIDE\(i\) asymptotically generalizes the standard transformer; \(ii\) amplifies the per\-token cumulative gradient signal by a factor ofKK, and \(iii\) routes around the FFN’s Lipschitz constraint by exposing a discrete, token\-indexed input with no obligation to hidden states\.
- •Empirical\.We empirically validate thatTIDEsignificantly benefits*rare*tokens and mitigates contextual collapse problem\. Across model scales from350350M to11B parameters,TIDEconsistently delivers up to performance improvements over standard transformer across various language modeling datasets \(*e\.g\.,*Wikitext, PubMed, DCLM\) as well as downstream tasks \(*e\.g\.,*HellaSwag, ARC, PIQA\)\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x1.png)Figure 1:Empirical Evidence that Rare Token Embeddings Remain Under\-trained:\(a\) Mean embeddingl2l2\-norm of LLaMa\-Base\-1B pretrained checkpoint showing amonotonic increasein norm from rare to common bins; \(b\) Embedding norm distributions for rare and common tokens: existence ofwideraredistribution versus thenarrowcommon peakconfirms that rare embeddings remain noise\-dominated and under\-trained; \(c\) Bin\-wise norm growth rate across intermediate training checkpoints per 50 billion tokens: rare token norms exhibitmonotonic declinewith continued training while common token continue growing\.
## 2When Context is Not Enough: Diagnosing Standard Transformers

### 2\.1The Rare Token Problem\.

##### Gradient Starvation Bottleneck:

Under minibatch SGD with batch sizeBB, sequence lengthTT, and per\-token squared gradient norm bounded byG2G^\{2\}, the embeddingev∈ℝde\_\{v\}\\in\\mathbb\{R\}^\{d\}for tokenvvreceives a non\-zero gradient only whenvvappears in the current batch\. In this setting, the expected cumulative squared gradient norm afterτ\\tautraining steps satisfies:

𝔼\[∑s=1τ‖∇evℒs‖2\]≤τ⋅fv⋅B⋅T⋅G2,\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\left\\lVert\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\right\]\\;\\leq\\;\\tau\\cdot f\_\{v\}\\cdot B\\cdot T\\cdot G^\{2\},\(2\.1\)wherefv:=Pr⁡\[uniformly drawn token position equalsv\]∈\(0,1\)f\_\{v\}:=\\Pr\[\\text\{uniformly drawn token position equals \}v\]\\in\(0,1\)is the unigram probability ofvv, with∑v∈𝒱fv=1\\sum\_\{v\\in\\mathcal\{V\}\}f\_\{v\}=1\. Tokenvvis*rare*iffv=ϵf\_\{v\}=\\epsilonfor someϵ≪1/\(BT\)\\epsilon\\ll 1/\(BT\), and tokenuuis*common*iffu≥cf\_\{u\}\\geq cfor some constantc\>0c\>0independent of\|𝒱\|\|\\mathcal\{V\}\|\. The full derivation of equation[2\.1](https://arxiv.org/html/2605.06216#S2.E1)is given in Appendix[C](https://arxiv.org/html/2605.06216#A3)\.

TierBin\(s\)Countfvf\_\{v\}𝔼\[Nv\]\\mathbb\{E\}\[N\_\{v\}\]Hapax \(rarest\)018\.3×10−98\.3\\\!\\times\\\!10^\{\-9\}≈\\approx1,660Near\-hapax1∼\\sim43\.3×10−83\.3\\\!\\times\\\!10^\{\-8\}≈\\approx6,640Uncommon2∼\\sim108\.3×10−88\.3\\\!\\times\\\!10^\{\-8\}≈\\approx16,600Mid\-freq\.3–6∼102–3\{\\sim\}10^\{2\\text\{\-\-\}3\}∼10−6\{\\sim\}10^\{\-6\}≈105–6\{\\approx\}10^\{5\\text\{\-\-\}6\}Common \(highest\)7–9∼106\{\\sim\}10^\{6\}8\.3×10−38\.3\\\!\\times\\\!10^\{\-3\}≈\\approx1\.66×1091\.66\\\!\\times\\\!10^\{9\}Table 1:Expected non\-zero gradient updates𝔼\[Nv\]\\mathbb\{E\}\[N\_\{v\}\]of token bins with 200B training tokens\.In an example corpus of Wikitext\-103\(merity2016pointer\)tokenized using LLaMA\-3 tokenizer \(\|𝒱\|=128,256\|\\mathcal\{V\}\|=128\{,\}256\) to generate frequency bins \(Appendix[B](https://arxiv.org/html/2605.06216#A2)\), the gradient disparity between*rare*and*common*tokens becomes severe\. Over a training budget of 200B tokens withB=8B=8,T=2048T=2048, the expected number of non\-zero gradient updates to tokenvv’s embedding is given as:

𝔼\[Nv\]=τ\(1−\(1−fv\)BT\)≈τ⋅fv⋅BTfor smallfv\.\\mathbb\{E\}\[N\_\{v\}\]\\;=\\;\\tau\\bigl\(1\-\(1\-f\_\{v\}\)^\{BT\}\\bigr\)\\;\\approx\\;\\tau\\cdot f\_\{v\}\\cdot BT\\quad\\text\{for small \}f\_\{v\}\\,\.\(2\.2\)In reference to frequency bins defined in Appendix[B](https://arxiv.org/html/2605.06216#A2), Table[1](https://arxiv.org/html/2605.06216#S2.T1)instantiates this across the200200B tokens in our training dataset illustrating the existence of high gradient update disparity between rare and common tokens\. Additionally, it can empirically inferred from the Figure[1](https://arxiv.org/html/2605.06216#S1.F1)\(c\) that this disparity doesn’t limit itself as a cold\-start artifact but grows*monotonically*as the training progresses\. The rare tokens’ norms decline while common tokens’ norms continuously increase\.

Ratio of gradient signal for rare and common tokens:For rarevv\(fv=εf\_\{v\}=\\varepsilon\) and commonuu\(fu≥c\>0f\_\{u\}\\geq c\>0\), letGmin2\>0G\_\{\\min\}^\{2\}\>0be a lower bound on the per\-step squared gradient norm conditioned on tokenuuappearing in a batch\. The ratio of cumulative gradient signals satisfies:

𝔼\[∑s‖∇evℒs‖2\]𝔼\[∑s‖∇euℒs‖2\]≤εBTG2κGmin2=O\(ε/c\)\\frac\{\\mathbb\{E\}\\\!\\left\[\\sum\_\{s\}\\\|\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\}\{\\mathbb\{E\}\\\!\\left\[\\sum\_\{s\}\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\}\\;\\leq\\;\\frac\{\\varepsilon\\,BT\\,G^\{2\}\}\{\\kappa\\,G\_\{\\min\}^\{2\}\}\\;=\\;O\\\!\\left\(\\varepsilon/c\\right\)\(2\.3\)whereκ:=1−\(1−c\)BT\>0\\kappa:=1\-\(1\-c\)^\{BT\}\>0withBTBT,G2G^\{2\}, andGmin2G\_\{\\min\}^\{2\}as fixed positive constants\. The full derivation is given in Appendix[C\.1](https://arxiv.org/html/2605.06216#A3.SS1)\. For the empirical instantiation in Table[1](https://arxiv.org/html/2605.06216#S2.T1), the ratio between rare tokens \(Bin 1\) and common tokens \(Bin 9\) isε/c≈10−6\\varepsilon/c\\approx 10^\{\-6\}, a disparity of six orders of magnitude of gradient signal between rare and common tokens over the same training budget\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x2.png)Figure 2:Empirical Evidence of Contextual Collapse:Heatmap illustrating the meanℓ2\\ell\_\{2\}\-distance‖hu\(ℓ\)−hv\(ℓ\)‖\\\|h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\\|between hidden states \(LLaMa\-Base\-1B\) of token pairs across 250 template sentences from three example categories of contextual collapse\. For all sampled pairs, the distance remainsnear\-zerofor majority of layers \(except towards end\) confirming the presence of contextual collapse\.

### 2\.2Contextual Collapse and the FFN’s Blind Spot\.

As mentioned before, the gradient starvation issue causes the rare\-token embeddings to converge to low\-norm, noisy representations\. More seriously, when two distinct tokens carry poorly trained embeddings of similar magnitude, a deeper structural failure arises: the hidden states produced for those tokens across all transformer layers may become indistinguishable, which can more problematic with similar context shared\. We formalize this failure mode and show that it is an inherent consequence of the Lipschitz continuity imposed on any FFN by its continuous domain\.

##### The Contextual Collapse Phenomenon:

At each layerℓ\\ell, the hidden statehv\(ℓ\)∈ℝdh^\{\(\\ell\)\}\_\{v\}\\in\\mathbb\{R\}^\{d\}of a tokenvvis produced by the attention mechanism operating on the surrounding context\. When two tokensu≠vu\\neq vappear in nearly identical syntactic environments, such as in case for grammatical homophones \(*their*or*there*\), numeric identity tokens \(*1847*,*1851*, or*1849*\), or rare domain\-specific synonyms \(*ibuprofen*or*acetaminophen*\), the context provides no distinguishing signal and thereby attention produces similar outputs for both\.

We formally define this as:

###### Definition 2\.1\(Contextual Collapse Set\)\.

For a toleranceδ\>0\\delta\>0, the*contextual collapse set*at layerℓ\\ellcan be formally defined as:

𝒞δ\(ℓ\):=\{\(u,v\)∈𝒱2:u≠v,‖hu\(ℓ\)−hv\(ℓ\)‖≤δ\},\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}\\;:=\\;\\bigl\\\{\(u,v\)\\in\\mathcal\{V\}^\{2\}:u\\neq v,\\;\\\|h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\\|\\leq\\delta\\bigr\\\},where the hidden states are averaged over a representative corpus of contexts\.

Figure[2](https://arxiv.org/html/2605.06216#S2.F2)provides direct empirical evidence of contextual collapse in LLaMa\-Base\-1B standard model estimated using 150 template sentences that differ by a single token pair under consideration\. For each of the three example canonical categories, the meanℓ2\\ell\_\{2\}distance‖hu\(ℓ\)−hv\(ℓ\)‖\\\|h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\\|remains persistently small across the entire depth axis except the last few final layers, confirming the prevalent existence of collapse\. Note that this phenomenon is more severe across numerical tokens category having notable collapse \(smallδ\\delta\) even within the final layer’s hidden states\.

###### Proposition 2\.2\(FFN Approximation Lower Bound on Collapsed Tokens\)\.

Let\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}be a collapsed token pair and letg:𝒱→ℝdg:\\mathcal\{V\}\\to\\mathbb\{R\}^\{d\}be any target function satisfying‖g\(u\)−g\(v\)‖=C\>0\\\|g\(u\)\-g\(v\)\\\|=C\>0\. Then for*any*choice of weightsW1,W2W\_\{1\},W\_\{2\}:

max⁡\{‖FFN\(hu\)−g\(u\)‖,‖FFN\(hv\)−g\(v\)‖\}≥C−LFFNδ2\.\\max\\bigl\\\{\\\|\\mathrm\{FFN\}\(h\_\{u\}\)\-g\(u\)\\\|,\\;\\\|\\mathrm\{FFN\}\(h\_\{v\}\)\-g\(v\)\\\|\\bigr\\\}\\;\\geq\\;\\frac\{C\-L\_\{\\mathrm\{FFN\}\}\\,\\delta\}\{2\}\.WhenC\>LFFNδC\>L\_\{\\mathrm\{FFN\}\}\\,\\delta, the right\-hand side is strictly positive: the FFN cannot approximateggto arbitrary precision on the collapsed pair\(u,v\)\(u,v\), regardless of how many parameters it has\.

Proof sketch\.Since\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}, the Lipschitz bound forces‖FFN\(hu\)−FFN\(hv\)‖≤LFFNδ\\\|\\mathrm\{FFN\}\(h\_\{u\}\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\\leq L\_\{\\mathrm\{FFN\}\}\\delta\. Applying the triangle inequality to the target separationC=‖g\(u\)−g\(v\)‖C=\\\|g\(u\)\-g\(v\)\\\|and substituting this bound yields‖FFN\(hu\)−g\(u\)‖\+‖FFN\(hv\)−g\(v\)‖≥C−LFFNδ\\\|\\mathrm\{FFN\}\(h\_\{u\}\)\-g\(u\)\\\|\+\\\|\\mathrm\{FFN\}\(h\_\{v\}\)\-g\(v\)\\\|\\geq C\-L\_\{\\mathrm\{FFN\}\}\\delta\. Since the maximum of two non\-negative terms is at least half their sum, the result follows\. See Appendix[D](https://arxiv.org/html/2605.06216#A4)for details\.

In this bound,δ\\deltais determined by the embeddings and attention layers; it is fixed before the FFN acts\. The separation targetCCis determined by the downstream task\.The Lipschitz constantLFFNL\_\{\\mathrm\{FFN\}\}is the only term the FFN controls, but it is bounded in practice because largeLFFNL\_\{\\mathrm\{FFN\}\}amplifies*every*input perturbation, degrading performance on the majority of non\-collapsed tokens\. The bound exposes a structural limitation: given fixed upstream representations, no FFN, regardless of width, can resolve a collapsed token pair without destabilizing other inputs\. The token index is injected once at the embedding layer and never reintroduced; unlike position, which is re\-injected via RoPE at every attention layer, token identity has no recovery mechanism\. Once intermediate layers erase the distinction, it is permanently lost to all subsequent computation\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x3.png)Figure 3:Main Architecture Diagram:TIDEaugments standard transformers with a parallel and globally sharedEmbeddingMemorymodule \(red region\) consisting ofKKindependentMemoryBlock, each mapping raw token indices tocontext\-free token identity signal\. Each layer uses a linear router to combine memory blocks signals and injects that into the residual stream additively\.

## 3TIDE:TokenIdentityDeliveredEverywhere

In section[2](https://arxiv.org/html/2605.06216#S2), we investigated and formalized two failure mode,*i\.e\.,*rare token and contextual collapse problem, within the standard transformer architecture\. In this work, we address these issues with a novel architecture modification:TIDEcounters the single\-injection assumption in conventional design of modern LLMs\.TIDEstops discarding the token identity information after embedding layer and instead make it directly accessible at every depth, so that each layer retains a token\-discriminative signal independent of the contextual residual stream\.

### 3\.1Preliminaries and Notations\.

Let𝒱\\mathcal\{V\}denote a vocabulary of size\|𝒱\|\|\\mathcal\{V\}\|,ddthe model hidden dimension,dbd\_\{b\}thememoryblockembedding dimension,KKthe number ofmemoryblocks,LLthe number of transformer layers,TTthe input sequence length, andBBthe batch size\. We usex∈ℤB×Tx\\in\\mathbb\{Z\}^\{B\\times T\}for a batch of token index sequences andh\(ℓ\)∈ℝB×T×dh^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{B\\times T\\times d\}for hidden states at layerℓ\\ell\. The standard LLaMA\-style transformer block at layerℓ\\ellcomputes:

h~ℓ\\displaystyle\\tilde\{h\}^\{\\ell\}=hℓ−1\+Attn⁡\(RMSNorm⁡\(hℓ−1\)\),\\displaystyle=h^\{\\ell\-1\}\+\\operatorname\{Attn\}\\\!\\bigl\(\\operatorname\{RMSNorm\}\(h^\{\\ell\-1\}\)\\bigr\),\(3\.1\)hℓ\\displaystyle h^\{\\ell\}=h~ℓ\+FFN⁡\(RMSNorm⁡\(h~ℓ\)\),\\displaystyle=\\tilde\{h\}^\{\\ell\}\+\\operatorname\{FFN\}\\\!\\bigl\(\\operatorname\{RMSNorm\}\(\\tilde\{h\}^\{\\ell\}\)\\bigr\),\(3\.2\)whereAttn\\operatorname\{Attn\}is multi\-head self\-attention with rotary position embeddings andFFN\\operatorname\{FFN\}is a SiLU\-gated feed\-forward network\. The primary embedding tableE∈ℝ\|𝒱\|×dE\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}maps each token index to an initial hidden stateh\(0\)=E\[x\]h^\{\(0\)\}=E\[x\]that will be processed by different transformer blocks\.

### 3\.2TIDE Architecture Design\.

TIDEaugments the standard transformer with a paralleltoken\-identity memory pathwaycomposed of three components:

memoryblocks:Each of theKKmemoryblocks maintains a dedicated embedding tableEk∈ℝ\|𝒱\|×dbE\_\{k\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\_\{b\}\}and maps a token indexv∈𝒱v\\in\\mathcal\{V\}to adbd\_\{b\}\-dimensional vector via a single embedding lookup followed by RMSNorm\(zhang2019rmsnorm\):

Mk\(v\)=RMSNorm⁡\(Ek\[v\]\)∈ℝdb\.M\_\{k\}\(v\)=\\operatorname\{RMSNorm\}\\\!\\bigl\(E\_\{k\}\[v\]\\bigr\)\\in\\mathbb\{R\}^\{d\_\{b\}\}\.\(3\.3\)Each block maintains its own independent embedding table with no parameter sharing across blocks, encouraging eachmemoryblockto learn a distinct projection of the token identity space\.

EmbeddingMemoryensemble:TheKKmemoryblocks are stacked into a single memory tensor computedonceper forward pass and shared across allLLtransformer layers:

𝐌=Stackk\(Mk\(x\)\)∈ℝB×T×K×db\.\\mathbf\{M\}=\\mathrm\{Stack\}\_\{k\}\\\!\\bigl\(M\_\{k\}\(x\)\\bigr\)\\in\\mathbb\{R\}^\{B\\times T\\times K\\times d\_\{b\}\}\.\(3\.4\)
Depth\-conditioned router and additive fusion:Within each transformer block, the post\-attention normalised hidden staten~ℓ=RMSNorm⁡\(h~ℓ\)\\tilde\{n\}^\{\\ell\}=\\operatorname\{RMSNorm\}\(\\tilde\{h\}^\{\\ell\}\)is fed to a lightweight linear router to generate composition ratioαkℓ\\alpha\_\{k\}^\{\\ell\}corresponding tokk\-th memory block\. We additionally introduce anull bankat slotK\+1K\{\+\}1satisfyingMK\+1\(v\)=𝟎M\_\{K\+1\}\(v\)=\\mathbf\{0\}for allvv, giving the router a learned “off” switch for with no dedicated parameters\. The fullTIDElayer update is:

𝜶ℓ\\displaystyle\\boldsymbol\{\\alpha\}^\{\\ell\}=softmax\(Wrℓn~ℓ\)∈ℝK\+1,\\displaystyle=\\mathrm\{softmax\}\\\!\\bigl\(W\_\{r\}^\{\\ell\}\\,\\tilde\{n\}^\{\\ell\}\\bigr\)\\in\\mathbb\{R\}^\{K\+1\},\(3\.5\)mℓ\(v\)=∑k=1K\+1αkℓMk\(v\),hℓ=h~ℓ\+FFN⁡\(n~ℓ\)\+mℓ\(v\)\.m^\{\\ell\}\(v\)=\\sum\_\{k=1\}^\{K\+1\}\\alpha\_\{k\}^\{\\ell\}\\,M\_\{k\}\(v\),\\qquad h^\{\\ell\}=\\tilde\{h\}^\{\\ell\}\+\\operatorname\{FFN\}\\\!\\bigl\(\\tilde\{n\}^\{\\ell\}\\bigr\)\+m^\{\\ell\}\(v\)\.\(3\.6\)whereWrℓ∈ℝ\(K\+1\)×dW\_\{r\}^\{\\ell\}\\in\\mathbb\{R\}^\{\(K\+1\)\\times d\}is a per\-layer learned weight matrix and∑k=1K\+1αkℓ=1\\sum\_\{k=1\}^\{K\+1\}\\alpha\_\{k\}^\{\\ell\}=1,αkℓ\>0\\alpha\_\{k\}^\{\\ell\}\>0for allkk\. The memory vectormℓ\(v\)m^\{\\ell\}\(v\)is addedadditivelyandindependentlyof the FFN output: neither pathway interact with the other, preserving the residual stream’s role as a shared communication channel\(elhage2021circuits\)\. Given that𝐌\\mathbf\{M\}is indexed by discrete token identityvv, not by hidden statehℓh^\{\\ell\}, the memory contribution of each token is independent of contextual mixing at any depth\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x4.png)Figure 4:VRAM & SSD parameter breakdown across LLaMA\-Base\-1B andTIDE\-1B model family with varyingMemoryBlockcountsK∈\{2,4,8,16,24\}K\\in\\\{2,4,8,16,24\\\}\.Computational and Memory Overhead:InTIDE, eachMk\(v\)=RMSNorm\(Ek\[v\]\)M\_\{k\}\(v\)=\\mathrm\{RMSNorm\}\(E\_\{k\}\[v\]\)is a single embedding lookup followed by RMSNorm andcontributes no matrix multiplications, so the per\-layer overhead reduces to one\(K\+1\)\(K\{\+\}1\)\-way softmax router and a weighted sum ofdbd\_\{b\}\-dimensional vectors\. This is negligible relative to the baseline FFN\. More importantly, everyEkE\_\{k\}is indexed by discrete token identityvvindependent ofhℓh^\{\\ell\}, so once training completes theEmbeddingMemorytables are static and can be 4\-bit quantized \(negligible performance impact\) and offloaded to SSD for on\-demand asynchronous prefetch augmented with appropriate caching mechanism\. As Figure[4](https://arxiv.org/html/2605.06216#S3.F4)shows, this maintains theeffective VRAM footprintofTIDEsimilar as LLaMA\-Base\-1B level \(1\.031\.03GB in 8\-bit\) while the SSD footprint scales from0to3\.1523\.152GB fromK=0K\{=\}0toK=24K\{=\}24\. Additional details regarding inference overhead andMemoryBlocks compression techniques can be found in Appendix[I](https://arxiv.org/html/2605.06216#A9)and[J](https://arxiv.org/html/2605.06216#A10)\.

### 3\.3TIDE: Theoretical Perspectives and Observations\.

#### 3\.3\.1Asymptotic Generalization to Standard Transformer\.

###### Proposition 3\.1\(Asymptotic Generalization\)\.

Letℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}denote the function class of standard transformers equation[3\.2](https://arxiv.org/html/2605.06216#S3.E2)andℱTIDE\\mathcal\{F\}\_\{\\mathrm\{TIDE\}\}the class of our proposedTIDEmodels equation[3\.6](https://arxiv.org/html/2605.06216#S3.E6)\. For anyϵ\>0\\epsilon\>0, there exist finite router parametersWrℓW\_\{r\}^\{\\ell\}such that

‖mℓ\(v\)‖<ϵ∀v∈𝒱,ℓ∈\{1,…,L\}\.\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert<\\epsilon\\qquad\\forall\\,v\\in\\mathcal\{V\},\\;\\ell\\in\\\{1,\\ldots,L\\\}\.That is,ℱTIDE\\mathcal\{F\}\_\{\\mathrm\{TIDE\}\}can approximate the standard transformerℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}to an arbitrary precision\.

Proof sketch\.SinceMK\+1\(v\)=𝟎M\_\{K\+1\}\(v\)=\\mathbf\{0\}, any weight assigned to the null bank contributes nothing tomℓ\(v\)m^\{\\ell\}\(v\)\. By the softmax constraint, increasing the null logitzK\+1ℓz\_\{K\+1\}^\{\\ell\}jointly suppresses all active bank weights:∑k=1Kαkℓ=K/\(K\+ezK\+1ℓ\)→0\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}^\{\\ell\}=K/\(K\+e^\{z\_\{K\+1\}^\{\\ell\}\}\)\\to 0aszK\+1ℓ→∞z\_\{K\+1\}^\{\\ell\}\\to\\infty\.

The triangle inequality then gives‖mℓ\(v\)‖≤\(1−αK\+1ℓ\)⋅C→0\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert\\leq\(1\-\\alpha\_\{K\+1\}^\{\\ell\}\)\\cdot C\\to 0, whereC=maxv,k⁡‖Mk\(v\)‖<∞C=\\max\_\{v,k\}\\left\\lVert M\_\{k\}\(v\)\\right\\rVert<\\infty\. SettingzK\+1ℓ=s∗=log⁡\(K\(C−ϵ\)/ϵ\)z\_\{K\+1\}^\{\\ell\}=s^\{\*\}=\\log\(K\(C\-\\epsilon\)/\\epsilon\)achieves‖mℓ\(v\)‖<ϵ\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert<\\epsilonat a finite parameter configuration\. The full proof can be found in Appendix[E](https://arxiv.org/html/2605.06216#A5)\.

#### 3\.3\.2TIDE’s K\-Pathway Gradient Amplification\.

In section[2\.1](https://arxiv.org/html/2605.06216#S2.SS1.SSS0.Px1)for the standard transformer, we discussed that the embeddingeve\_\{v\}of a rare tokenvvreceives a non\-zero gradient update only in steps wherevvappears in the batch, yielding an expected cumulative squared gradient norm bounded byτ⋅fv⋅BT⋅G2\\tau\\cdot f\_\{v\}\\cdot BT\\cdot G^\{2\}\. Our proposedTIDE’s architecture provides a design advantage ofKKindependentMemoryBlocks that enableKKdistinct, parallel gradient pathways into each token’s embedding tables on*every*training step, regardless of how rarely it occurs in the corpus\. We formalize the advantage as:

###### Proposition 3\.2\(KK\-Pathway Gradient Amplification\)\.

Letℒs\\mathcal\{L\}\_\{s\}denote the loss at stepssand letev\(k\)∈ℝdbe^\{\(k\)\}\_\{v\}\\in\\mathbb\{R\}^\{d\_\{b\}\}be the embedding of tokenvvinMemoryBlockkk\. Under minibatch SGD, the total expected cumulative squared gradient norm across allKKembedding tables for tokenvvsatisfies:

𝔼\[∑s=1τ∑k=1K‖∇ev\(k\)ℒs‖2\]≥K⋅τ⋅κv⋅Gmin2,\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\sum\_\{k=1\}^\{K\}\\left\\lVert\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\right\]\\;\\geq\\;K\\cdot\\tau\\cdot\\kappa\_\{v\}\\cdot G^\{2\}\_\{\\min\},\(3\.7\)whereκv=1−\(1−fv\)BT≈fv⋅BT\\kappa\_\{v\}=1\-\(1\-f\_\{v\}\)^\{BT\}\\approx f\_\{v\}\\cdot BTfor smallfvf\_\{v\}, andGmin2\>0G^\{2\}\_\{\\min\}\>0is a lower bound on the per\-step squared gradient norm conditioned on tokenvvappearing in the batch\. Consequently,TIDEprovides aKK\-fold amplification of gradient signal relative to the standard single\-embedding baseline\.

Proof sketch\.EachMemoryBlockkkmaintains an independent embedding tableEk∈ℝ\|𝒱\|×dbE\_\{k\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\_\{b\}\}with no parameter sharing across blocks111For the simplicity, we state the argument for a router overKKactive banks\.\. Within a forward pass during training,MemoryBlockkk’s outputMk\(v\)M\_\{k\}\(v\)is injected into every transformer layerℓ\\ellvia the routing weightαkℓ\\alpha^\{\\ell\}\_\{k\}, contributing to the residual stream and thereby to the loss\. Since theKKblocks are independent, the event\{v∈batchs\}\\\{v\\in\\text\{batch\}\_\{s\}\\\}triggers gradient flow through allKKembedding tables simultaneously\. Because router weights are strictly positive for finite logits, each table receives a non\-degenerate gradient on every step thatvvappears\. Summing across blocks and applying the lower bound from Appendix[C\.1](https://arxiv.org/html/2605.06216#A3.SS1)independently to each yields theKK\-fold amplification\. Please see Appendix[F](https://arxiv.org/html/2605.06216#A6)for details\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x5.png)Figure 5:Mean validation cross\-entropy loss per frequency decile of LLaMa\-Base\-1B andTIDE\-8E\-1B trained with200200B tokens\.TIDEstrictly improves over the baseline on every decile with gain concentrated for*rare*tokens and follows monotonically decreasing trend as*rare*\>\>*mid*\>\>*common*\.★\\bigstarEmpirical Investigation\[Rare Tokens Benefits fromTIDE\]:Figure[5](https://arxiv.org/html/2605.06216#S3.F5)\(a\) illustrate the mean cross\-entropy of LLaMa\-Base\-1B andTIDE\-8E\-1B at the matched200200B\-token training budget across all1010token frequency deciles\. Clearly, we can observe thatTIDE*strictly outperforms*LLaMa\-Base\-1B on every decile, but the absolute performance gap is sharply asymmetric for*rare vs\. common*tokens\. Per\-decile loss reduction in Figure[5](https://arxiv.org/html/2605.06216#S3.F5)\(b\) decays*monotonically*from0\.7040\.704nats \(9\.0%9\.0\\%relative\) on the rarest decile to0\.0680\.068nats \(2\.4%2\.4\\%\) on the most frequent decile, yielding a∼\\sim4\.8×4\.8\\timesdisparity in absolute gain between*rare*and*common*mean\. This rare\-skewed improvement profile is precisely the empirical signature provide support forKK\-fold gradient amplification to assist tokens where base embeddingEEis gradient starved during training\.

#### 3\.3\.3Contextual Collapse andTIDEKK\-MemoryBlocks\.

In a standard transformer, FFN receivesh\(ℓ\)h^\{\(\\ell\)\}as input, and when‖hu\(ℓ\)−hv\(ℓ\)‖≤δ\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVert\\leq\\deltais small, Lipschitz continuity forces its outputs to remain close regardless of the weights chosen \(see Section[2\.2](https://arxiv.org/html/2605.06216#S2.SS2)\)\.TIDEarchitectural design permits to break this constraint since eachMemoryBlockis indexed by the discrete token identityvvunlikeh\(ℓ\)h^\{\(\\ell\)\}, so its output carries no continuity obligation with respect toδ\\delta\. We formalize this observation as:

###### Proposition 3\.3\(Memory Ensemble Resolves Collapsed Token Separation\)\.

Let\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}be a collapsed token pair satisfying‖hu\(ℓ\)−hv\(ℓ\)‖≤δ\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVert\\leq\\delta, and letC\>0C\>0be any target separation\. For anyK≥1K\\geq 1, there existEmbeddingMemoryparameters\{Ek\}k=1K\\\{E\_\{k\}\\\}\_\{k=1\}^\{K\}such that:

‖Mk\(u\)−Mk\(v\)‖=C\\left\\lVert M\_\{k\}\(u\)\-M\_\{k\}\(v\)\\right\\rVert=C\(3\.8\)regardless ofδ=‖hu\(ℓ\)−hv\(ℓ\)‖\\delta=\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVertand independently ofLFFNL\_\{\\mathrm\{FFN\}\}\.

Proof sketch\.EachMemoryBlockoutputMk\(v\)=RMSNorm\(Ek\[v\]\)M\_\{k\}\(v\)=\\mathrm\{RMSNorm\}\(E\_\{k\}\[v\]\), whereEk\[v\]E\_\{k\}\[v\]is the row of embedding tableEkE\_\{k\}indexed by the discrete token identityvv\. The hidden stateh\(ℓ\)h^\{\(\\ell\)\}does not appear in this computation, soMk\(u\)M\_\{k\}\(u\)andMk\(v\)M\_\{k\}\(v\)depend only on their respective rowsEk\[u\]E\_\{k\}\[u\]andEk\[v\]E\_\{k\}\[v\]\. Since these rows are*separate, uncoupled parameters*, they can be assigned freely and independently for any token pair\(u,v\)\(u,v\), regardless of how smallδ\\deltais\. In particular, one can chooseEk\[u\]E\_\{k\}\[u\]andEk\[v\]E\_\{k\}\[v\]such that the resulting RMSNorm outputs can achieve any prescribed separationC\>0C\>0, which satisfies equation[3\.8](https://arxiv.org/html/2605.06216#S3.E8)\. See Appendix[G](https://arxiv.org/html/2605.06216#A7)for the additional details\.

We would like to clarify thatTIDEdoes not attempt to fight the Lipschitz constraint of the FFN; it routes around it by exploiting a fundamentally different input signal during the training\. Becausem\(ℓ\)\(v\)=∑k=1KαkℓMk\(v\)m^\{\(\\ell\)\}\(v\)=\\sum\_\{k=1\}^\{K\}\\alpha^\{\\ell\}\_\{k\}M\_\{k\}\(v\)is re\-injected in additive fashion at every transformer layer via independent per\-layer router weightsWrℓW^\{\\ell\}\_\{r\}, this token\-discriminative signal persists throughout the residual stream, it enables effective separation at every layerℓ\\ell\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x6.png)Figure 6:Layer\-wiseℓ2\\ell\_\{2\}separation‖hu\(ℓ\)−hv\(ℓ\)‖\\\|h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\\|between hidden states of token pairs from the three example contextual collapse categories, averaged across150150template sentences\.★\\bigstarEmpirical Investigation\[Contextual Collapse is Moderated byTIDE\]:To empirically validate the contribution of additive pathways ofMemoryBlocks, we revisit the three example contextual collapse categories from Figure[2](https://arxiv.org/html/2605.06216#S2.F2)\(grammatical homophones, numeric identity tokens, rare domain tokens\) and compare the layer\-wiseℓ2\\ell\_\{2\}separation‖hu\(ℓ\)−hv\(ℓ\)‖\\\|h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\\|between LLaMa=Base\-1B andTIDEon the same template sentences\. Figure[6](https://arxiv.org/html/2605.06216#S3.F6)\(top row\) reports the meanℓ2\\ell\_\{2\}norm averaged over all sampled token pairs in each category and bottom row reports the per\-layer differenceΔ=∥⋅∥TIDE−∥⋅∥Base\\Delta=\\\|\\cdot\\\|\_\{\\textsc\{TIDE\}\{\}\}\-\\\|\\cdot\\\|\_\{\\textsc\{Base\}\}\. Across all three categories, we can clearly observe thatTIDE’s token\-discriminative signal injection significantly increase inℓ2\\ell\_\{2\}separation prominently from middle to terminal layers which are distant from base embeddingEE\. Note thatnumerical tokenswhich suffers acute collapse \(Figure[2](https://arxiv.org/html/2605.06216#S2.F2)\), are the predominant beneficiary of the token identity injection throughout all layers\.

Table 2:Benchmark results for LLaMA\-Base and TIDE variants at 750M, 1B and 3B parameter scales\. PPL is LAMBDA perplexity \(lower is better\); BoolQ and LAMBADA use accuracy; all other columns use normalized accuracy \(%\)\. Best results per scale arebolded\.ModelPPL↓\\downarrowARC\-CARC\-EBoolQHellaSwagLAMBADAOBQAPIQASciQAverage750M ParametersLLaMA\-Base5\.6334\.660\.463\.560\.962\.836\.873\.885\.159\.7TIDE\-8E\-750M5\.1836\.061\.463\.062\.664\.937\.274\.885\.860\.71B ParametersLLaMA\-Base5\.1937\.564\.461\.763\.964\.637\.674\.986\.961\.4TIDE\-2E\-1B4\.9737\.665\.768\.764\.965\.136\.475\.587\.162\.6TIDE\-8E\-1B4\.8937\.564\.569\.365\.364\.740\.875\.586\.663\.0TIDE\-16E\-1B4\.7838\.765\.569\.765\.365\.737\.875\.987\.963\.3TIDE\-24E\-1B4\.6038\.966\.369\.566\.366\.437\.277\.387\.263\.73B ParametersLLaMA\-Base4\.0041\.274\.869\.071\.969\.440\.278\.193\.367\.2TIDE\-8E\-3B3\.8644\.375\.572\.372\.270\.240\.678\.393\.268\.3

![Refer to caption](https://arxiv.org/html/2605.06216v1/x7.png)Figure 7:Mean cross entropy loss across*rare, mid,*and*common*tokens with increasingKKMemoryBlocks\.

## 4Experiments and Ablation Studies

### 4\.1Performance Benchmarking of TIDE and Standard Transformer\.

➢Perplexity and Training Dynamics:TIDEintroduces a parallel additiveEmbeddingMemorypathways within conventional transformers to address the challenges associated with rare tokens and contextual collapse \(Section[3\.3\.3](https://arxiv.org/html/2605.06216#S3.SS3.SSS3),[3\.3\.2](https://arxiv.org/html/2605.06216#S3.SS3.SSS2)\)\. Here, we first investigate the influence of token\-indexed memory’s ability to improve the language modeling quality of standard transformers\. Figure[8](https://arxiv.org/html/2605.06216#S4.F8)presents the validation perplexity on three datasets \- Wikitext\(merity2016pointer\), PubMed\(jin2019pubmedqa\)and DCLM\(li2024datacomp\)held\-out corpora as a function of total training tokens for LLaMa\-Base\-1B andTIDE\-1B withK∈\{2,4,8,16,24\}K\\in\\\{2,4,8,16,24\\\}\. Clearly, eachTIDEvariantstrictly outperformsLLaMa\-Base\-1B monotonically fromK=2K=2toK=24K=24without saturation\. Performance gap opens early during training where with 100B tokensTIDEwith merely2−42\-4MemoryBlocks already matches the perplexity baseline reaches with 200B tokens, indicating that the additional gradient pathways translate to faster effective convergence\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x8.png)Figure 8:Wikitext\-2, PubMed, DCLM validation PPL as a function of training tokens indicating monotonic improvement across increasingKKblocks ofTIDEvariants\.➢Influence ofKKacross across*Rare, Mid, and Common*tokens:While perplexity based evaluation provide an overall performance benefit ofTIDE, a natural question arises from Proposition[3\.2](https://arxiv.org/html/2605.06216#S3.Thmtheorem2)as:Do theseKKMemoryBlockpathways empirically benefits rare and common tokens equally, and does the marginal benefit of additionalKKscale with token frequency?Figure[7](https://arxiv.org/html/2605.06216#S3.F7)decomposes held\-out cross\-entropy loss across*rare, mid, and common*bins as a function of increasingKK\. It can be observed that the absolute loss reduction over LLaMA\-Base\-1B is largest on rare tokens, moving fromK=0K=0toK=24K=24reduces rare\-bin loss from 6\.671 to 6\.250 nats \(−\-0\.421\) compared to only−\-0\.075 nats on common bin which is a 5\.6×\\timesdifference in absolute gain\. In addition, the per\-block marginal benefit \(the slope of each curve\) is 3\.7×\\timessteeper on rare tokens than on common tokens, illustrating the alignment with Section[3\.3\.2](https://arxiv.org/html/2605.06216#S3.SS3.SSS2)\. Note that even the smallest configuration\(K=2\)\(K=2\)can deliver∼\\sim55% of the total rare\-token improvement obtained atK=24K=24\(also reflected in PPL in Figure[8](https://arxiv.org/html/2605.06216#S4.F8)\), suggesting that the bulk of the benefit can be achieved with modest 2\-4 memory blocks\.

➢TIDE and Downstream Task Performance:Table[2](https://arxiv.org/html/2605.06216#S3.T2)reports zero\-shot accuracy on a suite of eight benchmarks \(ARC\-C, ARC\-E, BoolQ, HellaSwag, LAMBADA, OBQA, PIQA, SciQ\) across 750M, 1B, and 3B parameter scales of LLaMa\-BaseandTIDEvariants\. Across all settings,TIDEvariants consistently outperform standard transformer baselines, confirming the robustness of our proposed architecture\. More specifically, at the 1B scale,TIDEimproves the average score from 61\.4 \(Base\) to 63\.7 \(K=24K=24\), a\+\+2\.3% absolute gain, with monotonic improvement inKKon the perplexity column and on six of eight downstream tasks\.

### 4\.2A Deeper Investigation ofMemoryBlocks and the NULL Bank\.

The performance results in Section[4\.1](https://arxiv.org/html/2605.06216#S4.SS1)establish thatTIDEKK\-pathways provide informative signal beyond the contextual residual stream\. We now turn the investigation inward to understand the information stored acrossMemoryBlocks and per\-layer router dynamics after training\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x9.png)Figure 9:Mean cosine distance between primary embeddingEEand 8MemoryBlocks\.➢Distance between Primary Embedding andMemoryBlocks:We first aim to investigate ifTIDE’sKK\-MemoryBlocks converge to substantially distinct subspaces or collapse to replicate the base embeddingEE\. Figure[9](https://arxiv.org/html/2605.06216#S4.F9)\(a\) reports mean cosine distance \(1−1\|𝒱\|∑vcos⁡\(A\[v\],B\[v\]\)1\-\\tfrac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\}\\cos\(A\[v\],B\[v\]\)between every pair of embedding tables inTIDE\-8E\-1B\. We make two key observations: \(a\) everyMkM\_\{k\}without any explicit diversity loss, is highly distant fromEE\(mean cosine distance toEEranges from 0\.65 to 0\.99\), confirmingMemoryBlocks do not replicate the input\-embedding subspace but encode complementary token\-identity signal; \(b\) inter\-MkM\_\{k\}distance is relatively smaller, indicating convergence ofKKblocks to overlapping but non\-collapsed subspaces\.

➢Bin\-wise Router Statistics forMemoryBlocks and the NULL Bank:In reference to our Proposition[3\.1](https://arxiv.org/html/2605.06216#S3.Thmtheorem1)which state that TIDE asymptotically generalizes the standard transformer through the NULL bank, an empirical question persists:How does the router actually utilize this NULL bank, and does it do so in a token\-aware manner?Figure[10](https://arxiv.org/html/2605.06216#S4.F10)reports the mean routing weightα¯k\\bar\{\\alpha\}\_\{k\}allocated to each memory blockMkM\_\{k\}and to the NULL bank from last layer stratified by frequency bin\. We highlight two key observations: \(a\) the NULL\-bank weight is*monotonically non\-decreasing in token frequency*: it rises fromα¯NULL=0\.530\\bar\{\\alpha\}\_\{\\text\{NULL\}\}=0\.530on the rarest decile to0\.8890\.889on the most common decile\. The router has therefore learned to open the gate and admit substantial memory\-bank mass \(1−α¯NULL≈0\.471\-\\bar\{\\alpha\}\_\{\\text\{NULL\}\}\\approx 0\.47\) for rarest tokens in comparison to common tokens; \(b\) the router weight is non\-uniform across blocks whereM5M\_\{5\}carries an outsized share on rare tokens \(α¯5≈0\.31\\bar\{\\alpha\}\_\{5\}\\approx 0\.31\) before collapsing to near\-zero on common tokens wileM4M\_\{4\}specializes for mid\-decile tokens illustrating that distinct banks specialize to distinct frequency regimes*rather than redundantly co\-firing*\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x10.png)
![Refer to caption](https://arxiv.org/html/2605.06216v1/x11.png)

Figure 10:Bin\-wise mean router weightsα¯k\\bar\{\\alpha\}\_\{k\}acrossMemoryBlocks \(left\) and the NULL bank \(right\), stratified by token frequency decile\.

## 5Conclusion

In this work, we proposeTIDE, a transformer architecture that addresses two empirically established failure modes of standard LLMs: gradient starvation of rare tokens and contextual collapse of semantically distinct tokens\. We introduceEmbeddingMemory, an ensemble ofKKindependentmemoryblocks that map token indices directly to semantic vectors, injected at every layer via a depth\-conditioned router with a NULL bank\.TIDEprovides each transformer layer with a persistent, token\-specific signal that is immune to contextual collapse by construction\. We*theoretically*and*empirically*establish the benefits ofTIDEin addressing the issues associated with single\-token identity injection\. With extensive experiments across different model scale, we foundTIDEconsistently*improve performance*across multiple language modeling and downstream tasks\.

## References

## Appendix ABackground Work

### A\.1Memory Augmented Architectures\.

Memory\-augmented models are designed to expand a model’s effective parameter space without incurring large computational overhead\. Early work on memory networks was introduced byweston2014memory, and later extended with fully end\-to\-end trainable variants withsukhbaatar2015end\. Neural Turing Machines\(graves2014neural;Graves2016HybridCU\)incorporate an external, trainable memory that works alongside other neural components to simulate a differentiable, trainable computing system\. Product\-key networks\(lample2019large\)improve the efficiency and scalability of memory retrieval and propose a key\-value memory layer that can scale to very large sizes while keeping exact search on the key space\. More recently, PEER\(he2024mixture\)has advanced these ideas by replacing traditional vector\-based memory values with rank\-one matrices, linking memory\-augmented architectures with mixture\-of\-experts models\.

Accurate factual generation remains a critical objective for generative models, often evaluated using open\-domain question answering benchmarks\(chen2017reading;chen\-yih\-2020\-open\)and other tasks requiring substantial knowledge\(petroni2021kilt\)\. Models that can effectively encode factual knowledge from training data are better equipped to provide correct responses to knowledge\-intensive queries\. While larger models generally demonstrate improved factual accuracy\(roberts2020much;brown2020language\), hallucination remains a persistent challenge\. One effective approach for mitigating this issue is retrieval\-augmented generation, which leverages external knowledge sources to improve factual consistency\(lewis2020retrieval;karpukhin2020dense;khandelwal2019generalization\)\. Several language models have incorporated text retrieval from the pretraining stage\. REALM\(guu2020retrieval\)augments a BERT model with one retrieval step to solve QA tasks\. Retro\(borgeaud2022improving\)enhances auto\-regressive decoding with multiple rounds of retrieval, once per 64 tokens\. The retrieved texts are injected through a two\-layer encoder and then several cross\-attention layers in the decoder\. Retro\+\+\(wang2023shall\)explores the scalability of Retro by reproducing Retro up to 9\.5B parameters\. Meanwhile, several models are adapted to retrieval in the finetuning stage\. WebGPT\(nakano2021webgpt\)learns to use search engine through imitation learning in a text\-based web\-browsing environment\. Toolformer\(schick2023toolformer\)performs decoding with multiple tools including search engine, and the finetuning data is labeled by the language model itself\.

### A\.2Understanding Feed\-Forward Networks in Transformers\.

Several studies have investigated the role of feed\-forward networks \(FFNs\) in transformers, particularly their contribution to storing and retrieving knowledge learned during pretraining\.geva2021transformerdemonstrated that FFNs can be interpreted as key–value memories that activate on specific lexical or semantic patterns, while follow\-up work showed that FFNs promote vocabulary\-level concepts during prediction\(geva2022transformer2\)\. Additional related analyses in embedding space further explored how FFN activations correspond to linguistic features and factual recall\(dar2023analyzing;nichani2024understanding\)\. Within their framework, the first layer acts as a pattern detector \(“keys"\) while the second layer projects specific information into the residual stream \(“values"\)\. This modularity is evidenced by the identification of specific “knowledge neurons" responsible for storing distinct facts\. More broadly, the interpretation of neural networks as associative or persistent memory systems connects this line of work to earlier memory\-augmented architectures\(sukhbaatar2019augmenting\)\. However, these analyses rely on contextualized residual activations and require extensive post\-hoc mining of calibration data, making the inferred query space indirect and difficult to interpret\. Furthermore, since FFNs operate exclusively on its contextualized residual stream, their ability to distinguish tokens is mathematically bottlenecked when distinct token appears in identical syntactic context which leads to the contextual collapse problem\. Recently, MoLE\(jie2025mixture\), illustrates that in mixture\-of\-experts \(MoEs\), majority of experts can be trained directly with token\-level input embeddings\. Following the static routing concept, MemoryLLM\(jaiswal2026memoryllm\)completely decouples FFNs from the contextual residual stream by directly training a layer\-local and token\-indexed embedding table to enhance interpretability and reduce compute\. Concurrently, in the STEM\(sadhukhan2026stem\)architecture, FFN is partially replaced to embedding table, with the substitution occurring at the up\-projection layer\. TIDE builds upon this token\-level intuition but fundamentally diverges from standard FFNs\. Instead of relying on contextual mixtures vulnerable to collapse, TIDE bypasses this entirely by injecting a context\-free token identity directly into residual stream at every depth\.

### A\.3Advancements with Embedding and Modern LLMs\.

Standard transformer models rely on a single\-injection assumption where token embeddings are looked up once at the input layer and subsequently faded out\. Since language vocabularies strictly obey Zipf’s law\(zipf1949human;pilgrim2021bias\), majority of tokens infrequently appear in the training corpus\. Sub\-word tokenization\(sennrich2016bpe\)is introduced to mitigate out\-of\-vocabulary issue, yet they do not resolve the fundamental long\-tail distribution of tokens, which continues to degrade the performance of contextualized embeddings on rare words\. Under standard stochastic gradient descent, this skewed distribution leads to gradient starvation for rare tokens\. Embedding sharing\(inan2017tying;ofir2017tying\)attempt to stabilize training of embedding by tying input and output embedding weights, allowing input representations to benefit directly from the richer gradient signal of the pre\-softmax layer\. However, simply sharing parameters between the input and output layers does not structurally resolve the gradient starvation on low\-frequency tokens\. TIDE directly solve this gradient starvation bottleneck by utilizing independent memory blocks which lead to amplification of gradient signals to token representations, disproportionately benefiting rare tokens\.

## Appendix BDetails of Frequency Bins Generated from Vocabulary

The WikiText\-103 training split is tokenized with the LLaMA\-3 tokenizer \(\|𝒱\|=128,256\|\\mathcal\{V\}\|=128\{,\}256, sequence lengthT=2,048T=2\{,\}048\), producing a token stream of∼\\sim120M tokens over∼\\sim58k sequences, of which 65,569 vocabulary entries appear at least once\. Raw occurrence counts are then passed through a structural filter that removes BOS/EOS special tokens, pure\-whitespace tokens, and non\-alphanumeric punctuation\. In total, 28 tokens \(0\.04%0\.04\\%of observed types\) are removed, leaving 65,541 token types for the binning procedure\.

Table 3:Frequency decile bin reference for WikiText\-103 with the LLaMA\-3 BPE tokenizer \(128K vocabulary\) after structural filtering\. Each bin contains≈6,554\{\\approx\}6\{,\}554token types ranked by corpus frequency\. Representative example tokens are drawn from each tier for semantic illustration\.BinFreq\. rangeTypesDescriptionRole01–26,555Hapax & near\-hapax tokensRare12–66,554Domain\-specific, rare namesRare26–206,554Uncommon words, rare entitiesRare320–616,554Infrequent content wordsMid461–1336,554Occasional content wordsMid5133–2406,554Moderate\-frequency wordsMid6240–4166,554Fairly common content wordsMid7416–7696,554Common function \+ content wordsCommon8769–1,8566,554High\-frequency content wordsCommon91,856–999,9996,554Highest\-frequency \(below cap\)CommonRare examples \(Bins 0–2\)cefuroxime,morgan,Produto,Teotihuacan,toujoursMid examples \(Bins 3–6\)volcano,diocese,battalion,peninsula,sculptor,harbourCommon examples \(Bins 7–9\)there,their,also,however,first,time##### Decile assignment of Token Types:

The 65,541 cleaned types are sorted by ascending corpus frequency and partitioned intoB=10B=10equal\-cardinality bins, each containing≈6,554\{\\approx\}6\{,\}554types, with bin index assigned as:

b\(v\)=min⁡\(⌊rank\(v\)\|𝒱clean\|⋅B⌋,B−1\)\.b\(v\)\\;=\\;\\min\\\!\\left\(\\left\\lfloor\\frac\{\\mathrm\{rank\}\(v\)\}\{\|\\mathcal\{V\}\_\{\\mathrm\{clean\}\}\|\}\\cdot B\\right\\rfloor,\\;B\-1\\right\)\.\(B\.1\)
Bin assignment is determined by rank alone while the absolute frequencies establish the ordering\. Crucially, while every bin contains the same number of token*types*, the bins account for vastly different shares of the training token*stream*under Zipf’s law\. Throughout this paper, Bins\{0−2\}\\\{0\-2\\\}are referred to asrare, Bins\{3−6\}\\\{3\-6\\\}asmid\-frequency, and Bins\{7−9\}\\\{7\-9\\\}ascommontokens\.

## Appendix CGradient Starvation Bound: Derivation of Equation \([2\.1](https://arxiv.org/html/2605.06216#S2.E1)\)

The lossℒs\\mathcal\{L\}\_\{s\}depends oneve\_\{v\}only through positions inbatchs\\mathrm\{batch\}\_\{s\}that equalvv\. Ifv∉batchsv\\notin\\mathrm\{batch\}\_\{s\}, then∂ℒs/∂ev=𝟎\\partial\\mathcal\{L\}\_\{s\}/\\partial e\_\{v\}=\\mathbf\{0\}exactly\. Formally, we can write as:

∇evℒs=𝟎wheneverv∉batchs\.\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}=\\mathbf\{0\}\\quad\\text\{whenever \}v\\notin\\mathrm\{batch\}\_\{s\}\.\(C\.1\)
Let’s defineXs:=𝟏\[v∈batchs\]X\_\{s\}:=\\mathbf\{1\}\[v\\in\\mathrm\{batch\}\_\{s\}\]\. Since each of theBTBTpositions is drawn i\.i\.d\. from\{fv\}\\\{f\_\{v\}\\\}, the event\{v∉batchs\}\\\{v\\notin\\mathrm\{batch\}\_\{s\}\\\}requires allBTBTdraws to avoidvv, each with probability\(1−fv\)\(1\-f\_\{v\}\)\. Therefore:

Pr⁡\[v∈batchs\]=𝔼\[Xs\]=1−\(1−fv\)BT≤fv⋅B⋅T,\\Pr\[v\\in\\mathrm\{batch\}\_\{s\}\]\\;=\\;\\mathbb\{E\}\[X\_\{s\}\]\\;=\\;1\-\(1\-f\_\{v\}\)^\{BT\}\\;\\leq\\;f\_\{v\}\\cdot B\\cdot T,\(C\.2\)where the inequality applies the Bernoulli bound\(1−fv\)BT≥1−fvBT\(1\-f\_\{v\}\)^\{BT\}\\geq 1\-f\_\{v\}BT, valid forfv∈\[0,1\]f\_\{v\}\\in\[0,1\]\.

Next, we bound the cumulative squared gradient using equation equation[C\.1](https://arxiv.org/html/2605.06216#A3.E1), as‖∇evℒs‖2≤G2⋅Xs\\left\\lVert\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\leq G^\{2\}\\cdot X\_\{s\}\. Taking expectations and summing overτ\\tausteps:

𝔼\[∑s=1τ‖∇evℒs‖2\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\left\\lVert\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\right\]≤∑s=1τG2⋅𝔼\[Xs\]≤τ⋅fv⋅B⋅T⋅G2,\\displaystyle\\leq\\sum\_\{s=1\}^\{\\tau\}G^\{2\}\\cdot\\mathbb\{E\}\[X\_\{s\}\]\\;\\leq\\;\\tau\\cdot f\_\{v\}\\cdot B\\cdot T\\cdot G^\{2\},\(C\.3\)which completes the derivation of equation[2\.1](https://arxiv.org/html/2605.06216#S2.E1)\.

### C\.1Understanding the Ratio of Gradient for Rare and Common Tokens

Letvvbe a rare token withfv=ε≪1/\(BT\)f\_\{v\}=\\varepsilon\\ll 1/\(BT\)anduua common token withfu≥c\>0f\_\{u\}\\geq c\>0\. Using the aforementioned derivation, we have

𝔼\[∑s=1τ‖∇evℒs‖2\]≤τ⋅ε⋅BT⋅G2\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\\|\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\\;\\leq\\;\\tau\\cdot\\varepsilon\\cdot BT\\cdot G^\{2\}\.\(C\.4\)
To determine the lower bound for common frequency tokens, we assume a standard non\-degeneracy condition: whenever tokenuuappears in batchss, the per\-step squared gradient norm satisfies‖∇euℒs‖2≥Gmin2\>0\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\geq G\_\{\\min\}^\{2\}\>0on the event\{u∈batchs\}\\\{u\\in\\mathrm\{batch\}\_\{s\}\\\}\. This holds throughout training whenever the cross\-entropy loss has not been minimized on tokenuu\.

DefiningXs\(u\):=𝟏\[u∈batchs\]X\_\{s\}^\{\(u\)\}:=\\mathbf\{1\}\[u\\in\\mathrm\{batch\}\_\{s\}\]and using‖∇euℒs‖2≥Gmin2⋅Xs\(u\)\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\geq G\_\{\\min\}^\{2\}\\cdot X\_\{s\}^\{\(u\)\}, we take expectations and sum overτ\\tausteps:

𝔼\[∑s=1τ‖∇euℒs‖2\]≥τ⋅Pr⁡\[u∈batch\]⋅Gmin2=τ\(1−\(1−fu\)BT\)Gmin2\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\\;\\geq\\;\\tau\\cdot\\Pr\[u\\in\\mathrm\{batch\}\]\\cdot G\_\{\\min\}^\{2\}\\;=\\;\\tau\\bigl\(1\-\(1\-f\_\{u\}\)^\{BT\}\\bigr\)G\_\{\\min\}^\{2\}\.\(C\.5\)Sincefu≥c\>0f\_\{u\}\\geq c\>0and1−\(1−x\)n1\-\(1\-x\)^\{n\}is non\-decreasing inxx:

1−\(1−fu\)BT≥1−\(1−c\)BT=:κ\>0,1\-\(1\-f\_\{u\}\)^\{BT\}\\;\\geq\\;1\-\(1\-c\)^\{BT\}\\;=:\\;\\kappa\\;\>\\;0,\(C\.6\)whereκ\\kappais a strictly positive constant depending only oncc,BB,TT\. Substituting it into equation[C\.5](https://arxiv.org/html/2605.06216#A3.E5)gives:

𝔼\[∑s=1τ‖∇euℒs‖2\]≥τκGmin2\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\\;\\geq\\;\\tau\\,\\kappa\\,G\_\{\\min\}^\{2\}\.\(C\.7\)
To determine the ratio between rare and common tokens, we divide equation[C\.4](https://arxiv.org/html/2605.06216#A3.E4)by equation[C\.7](https://arxiv.org/html/2605.06216#A3.E7):

𝔼\[∑s‖∇evℒs‖2\]𝔼\[∑s‖∇euℒs‖2\]≤τεBTG2τκGmin2=BT⋅G2κ⋅Gmin2⋅ε\\frac\{\\mathbb\{E\}\\\!\\left\[\\displaystyle\\sum\_\{s\}\\\|\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\}\{\\mathbb\{E\}\\\!\\left\[\\displaystyle\\sum\_\{s\}\\\|\\nabla\_\{e\_\{u\}\}\\mathcal\{L\}\_\{s\}\\\|^\{2\}\\right\]\}\\;\\leq\\;\\frac\{\\tau\\,\\varepsilon\\,BT\\,G^\{2\}\}\{\\tau\\,\\kappa\\,G\_\{\\min\}^\{2\}\}\\;=\\;\\frac\{BT\\cdot G^\{2\}\}\{\\kappa\\cdot G\_\{\\min\}^\{2\}\}\\cdot\\,\\varepsilon\\;\(C\.8\)SinceBTBT,G2G^\{2\}, andGmin2G\_\{\\min\}^\{2\}are fixed positive constants\. By the first\-order Taylor expansion for smallcc, we have:

κ=1−\(1−c\)BT≈c⋅BT,\\kappa\\;=\\;1\-\(1\-c\)^\{BT\}\\;\\approx\\;c\\cdot BT\\;,\(C\.9\)so the prefactor satisfies:

BT⋅G2κ⋅Gmin2=BT⋅G2\(c⋅BT\)⋅Gmin2=G2c⋅Gmin2=O\(1c\),\\frac\{BT\\cdot G^\{2\}\}\{\\kappa\\cdot G\_\{\\min\}^\{2\}\}\\;=\\;\\frac\{BT\\cdot G^\{2\}\}\{\(c\\cdot BT\)\\cdot G\_\{\\min\}^\{2\}\}\\;=\\;\\frac\{G^\{2\}\}\{c\\cdot G\_\{\\min\}^\{2\}\}\\;=\\;O\\\!\\left\(\\frac\{1\}\{c\}\\right\),\(C\.10\)and the full bound on the ratio isO\(ε/c\)O\(\\varepsilon/c\), which completes the proof of equation[2\.3](https://arxiv.org/html/2605.06216#S2.E3)\.

##### Concrete evaluation on WikiText\-103\.

Withε=8\.3×10−9\\varepsilon=8\.3\\times 10^\{\-9\}\(Bin\-0 rare tokens,nv=1n\_\{v\}=1\),c=8\.3×10−3c=8\.3\\times 10^\{\-3\}\(Bin\-9 common token,nu≈106n\_\{u\}\\approx 10^\{6\}\),B=8B=8,T=2048T=2048:

κ\\displaystyle\\kappa=1−\(1−c\)BT≈1−e−cBT=1−e−136≈1,\\displaystyle\\;=\\;1\-\(1\-c\)^\{BT\}\\;\\approx\\;1\-e^\{\-c\\,BT\}\\;=\\;1\-e^\{\-136\}\\;\\approx\\;1,\(C\.11\)εc\\displaystyle\\frac\{\\varepsilon\}\{c\}=8\.3×10−98\.3×10−3=10−6\.\\displaystyle\\;=\\;\\frac\{8\.3\\times 10^\{\-9\}\}\{8\.3\\times 10^\{\-3\}\}\\;=\\;10^\{\-6\}\.\(C\.12\)Under the conservative assumptionG2/Gmin2=10G^\{2\}/G\_\{\\min\}^\{2\}=10, the gradient signal accumulated by a Bin\-0 hapax embedding is bounded above by10−510^\{\-5\}times that of a Bin\-9 common token over the same training run, there exists a gradient disparity of five orders of magnitude\.

## Appendix DFull Proof of Proposition[2\.2](https://arxiv.org/html/2605.06216#S2.Thmtheorem2)

We prove that for any collapsed pair\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}and any target functiong:𝒱→ℝdg:\\mathcal\{V\}\\to\\mathbb\{R\}^\{d\}with‖g\(u\)−g\(v\)‖=C\>LFFNδ\\\|g\(u\)\-g\(v\)\\\|=C\>L\_\{\\mathrm\{FFN\}\}\\delta, no setting of the FFN weights can approximateggto error less than\(C−LFFNδ\)/2\(C\-L\_\{\\mathrm\{FFN\}\}\\delta\)/2on both tokens simultaneously\.

By definition of the contextual collapse set,\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}implies‖hu−hv‖≤δ\\\|h\_\{u\}\-h\_\{v\}\\\|\\leq\\delta\. Since FFNs are Lipschitz in its hidden\-state input\(virmaux2018lipschitz\), we have

‖FFN\(hu\)−FFN\(hv\)‖≤LFFNδ\.\\\|\\mathrm\{FFN\}\(h\_\{u\}\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\\;\\leq\\;L\_\{\\mathrm\{FFN\}\}\\,\\delta\.\(D\.1\)The FFN outputs foruuandvvmust therefore lie within a ball of radiusLFFNδL\_\{\\mathrm\{FFN\}\}\\deltaof each other — they cannot be far apart, regardless of how the weightsW1,W2W\_\{1\},W\_\{2\}are chosen\.

Using triangle inequality on the target separation, we can write the target separation as:

C\\displaystyle C=‖g\(u\)−g\(v\)‖\\displaystyle=\\\|g\(u\)\-g\(v\)\\\|=‖g\(u\)−FFN\(hu\)\+FFN\(hu\)−FFN\(hv\)\+FFN\(hv\)−g\(v\)‖\\displaystyle=\\\|g\(u\)\-\\mathrm\{FFN\}\(h\_\{u\}\)\+\\mathrm\{FFN\}\(h\_\{u\}\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\+\\mathrm\{FFN\}\(h\_\{v\}\)\-g\(v\)\\\|≤‖g\(u\)−FFN\(hu\)‖\+‖FFN\(hu\)−FFN\(hv\)‖\+‖FFN\(hv\)−g\(v\)‖\.\\displaystyle\\leq\\\|g\(u\)\-\\mathrm\{FFN\}\(h\_\{u\}\)\\\|\+\\\|\\mathrm\{FFN\}\(h\_\{u\}\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\+\\\|\\mathrm\{FFN\}\(h\_\{v\}\)\-g\(v\)\\\|\.\(D\.2\)
By substituting Lipschitz bound equation[D\.1](https://arxiv.org/html/2605.06216#A4.E1)into equation[D\.2](https://arxiv.org/html/2605.06216#A4.E2):

C≤‖g\(u\)−FFN\(hu\)‖\+LFFNδ\+‖g\(v\)−FFN\(hv\)‖\.C\\;\\leq\\;\\\|g\(u\)\-\\mathrm\{FFN\}\(h\_\{u\}\)\\\|\+L\_\{\\mathrm\{FFN\}\}\\,\\delta\+\\\|g\(v\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\.\(D\.3\)
Rearranging equation[D\.3](https://arxiv.org/html/2605.06216#A4.E3)to isolate error terms:

‖g\(u\)−FFN\(hu\)‖\+‖g\(v\)−FFN\(hv\)‖≥C−LFFNδ\.\\\|g\(u\)\-\\mathrm\{FFN\}\(h\_\{u\}\)\\\|\+\\\|g\(v\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\\;\\geq\\;C\-L\_\{\\mathrm\{FFN\}\}\\,\\delta\.\(D\.4\)The left\-hand side is the sum of two non\-negative approximation errors\. WhenC\>LFFNδC\>L\_\{\\mathrm\{FFN\}\}\\delta, the right\-hand side is strictly positive, so at least one of the two errors is positive\. Specifically, since the maximum of two non\-negative numbers is at least half their sum:

max⁡\{‖g\(u\)−FFN\(hu\)‖,‖g\(v\)−FFN\(hv\)‖\}≥C−LFFNδ2\.\\max\\bigl\\\{\\\|g\(u\)\-\\mathrm\{FFN\}\(h\_\{u\}\)\\\|,\\;\\\|g\(v\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\\|\\bigr\\\}\\;\\geq\\;\\frac\{C\-L\_\{\\mathrm\{FFN\}\}\\,\\delta\}\{2\}\.\(D\.5\)whenC\>LFFNδC\>L\_\{\\mathrm\{FFN\}\}\\,\\delta, the right side is strictly positive, which completes the proof\.

Three quantities govern the bound among which none of which are under the FFN’s control:

- •Target separation\(CC\): The required difference‖g\(u\)−g\(v\)‖\\\|g\(u\)\-g\(v\)\\\|between the optimal representations of tokensuuandvv\. This is determined by what the downstream task needs — for example, the grammatical\-class distance between*“their”*\(possessive determiner\) and*“there”*\(locative adverb\) is fixed by the language, not by the model’s architecture\.
- •Input proximity\(δ\\delta\): The distance‖hu−hv‖\\\|h\_\{u\}\-h\_\{v\}\\\|between the hidden states that the attention layer produces foruuandvv\. When two tokens appear in nearly identical contexts with same surrounding words and same syntactic position \- attention cannot distinguish them, andδ\\deltais small\. The FFN receiveshuh\_\{u\}andhvh\_\{v\}as inputs; it cannot*choose*to receive different inputs\.
- •FFN change limit\(LFFNL\_\{\\mathrm\{FFN\}\}\): The Lipschitz constant controls how rapidly the FFN output can change per unit change in input\. WhileLFFNL\_\{\\mathrm\{FFN\}\}depends on the weights and could in principle be made large, doing so causes exploding gradients and training instability\. A largeLFFNL\_\{\\mathrm\{FFN\}\}amplifies the FFN’s response to*every*input perturbation which is not only limited to the gap betweenhuh\_\{u\}andhvh\_\{v\}\. This sharply degrades performance on the majority of tokens whose hidden states are not collapsed\.

## Appendix EFull Proof of Proposition[3\.1](https://arxiv.org/html/2605.06216#S3.Thmtheorem1)

We prove that for anyϵ\>0\\epsilon\>0there exist finite router parametersWrℓW\_\{r\}^\{\\ell\}such that‖mℓ\(v\)‖<ϵ\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert<\\epsilonfor allv∈𝒱v\\in\\mathcal\{V\}andℓ∈\{1,…,L\}\\ell\\in\\\{1,\\ldots,L\\\}\.

GivenMK\+1\(v\)=𝟎M\_\{K\+1\}\(v\)=\\mathbf\{0\}by definition, for any router weightαK\+1ℓ∈\(0,1\)\\alpha\_\{K\+1\}^\{\\ell\}\\in\(0,1\):

αK\+1ℓ⋅MK\+1\(v\)=αK\+1ℓ⋅𝟎=𝟎\.\\alpha\_\{K\+1\}^\{\\ell\}\\cdot M\_\{K\+1\}\(v\)=\\alpha\_\{K\+1\}^\{\\ell\}\\cdot\\mathbf\{0\}=\\mathbf\{0\}\.This ensures that null bank contributes nothing toTIDE\. The memory term therefore simplifies to:

mℓ\(v\)=∑k=1K\+1αkℓMk\(v\)=∑k=1KαkℓMk\(v\)\.m^\{\\ell\}\(v\)=\\sum\_\{k=1\}^\{K\+1\}\\alpha\_\{k\}^\{\\ell\}\\,M\_\{k\}\(v\)=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}^\{\\ell\}\\,M\_\{k\}\(v\)\.
By the softmax constraint equation[3\.5](https://arxiv.org/html/2605.06216#S3.E5),∑k=1Kαkℓ=1−αK\+1ℓ\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}^\{\\ell\}=1\-\\alpha\_\{K\+1\}^\{\\ell\}\. Applying the triangle inequality, we can bound the memory norm as:

‖mℓ\(v\)‖≤∑k=1Kαkℓ‖Mk\(v\)‖≤\(1−αK\+1ℓ\)⋅C,\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert\\;\\leq\\;\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}^\{\\ell\}\\,\\left\\lVert M\_\{k\}\(v\)\\right\\rVert\\;\\leq\\;\\bigl\(1\-\\alpha\_\{K\+1\}^\{\\ell\}\\bigr\)\\cdot C,\(E\.1\)whereC=maxv∈𝒱,k≤K⁡‖Mk\(v\)‖<∞C=\\max\_\{v\\in\\mathcal\{V\},\\,k\\leq K\}\\left\\lVert M\_\{k\}\(v\)\\right\\rVert<\\infty\.

Next, we express the active weight sum in terms of the null logit\. Upon settingzK\+1ℓ=s\>0z\_\{K\+1\}^\{\\ell\}=s\>0andzkℓ=0z\_\{k\}^\{\\ell\}=0for allk≤Kk\\leq K\. By softmax:

1−αK\+1ℓ=KK\+es\.1\-\\alpha\_\{K\+1\}^\{\\ell\}=\\frac\{K\}\{K\+e^\{s\}\}\.\(E\.2\)Ass→∞s\\to\\infty,K/\(K\+es\)→0K/\(K\+e^\{s\}\)\\to 0, so the total active bank weight vanishes\.

Substituting equation[E\.2](https://arxiv.org/html/2605.06216#A5.E2)into equation[E\.1](https://arxiv.org/html/2605.06216#A5.E1):‖mℓ\(v\)‖≤KC/\(K\+es\)\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert\\leq KC/\(K\+e^\{s\}\)\. For anyϵ∈\(0,C\)\\epsilon\\in\(0,C\), solvingKC/\(K\+es\)=ϵKC/\(K\+e^\{s\}\)=\\epsilongives:

s∗=log⁡\(K\(C−ϵ\)ϵ\)\.s^\{\*\}=\\log\\\!\\left\(\\frac\{K\(C\-\\epsilon\)\}\{\\epsilon\}\\right\)\.\(E\.3\)Now, we have:KC/\(K\+es∗\)=KCϵ/KC=ϵKC/\(K\+e^\{s^\{\*\}\}\)=KC\\epsilon/KC=\\epsilon\. Therefore‖mℓ\(v\)‖≤ϵ\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert\\leq\\epsilonuniformly over allvvandℓ\\ellfor anys≥s∗s\\geq s^\{\*\}\.

Note that, the thresholds∗s^\{\*\}is finite for anyϵ∈\(0,C\)\\epsilon\\in\(0,C\)\. SettingWrℓW\_\{r\}^\{\\ell\}so thatWrℓn~ℓ≈s∗𝐞K\+1W\_\{r\}^\{\\ell\}\\tilde\{n\}^\{\\ell\}\\approx s^\{\*\}\\,\\mathbf\{e\}\_\{K\+1\}is a finite parameter assignment under which‖mℓ\(v\)‖<ϵ\\left\\lVert m^\{\\ell\}\(v\)\\right\\rVert<\\epsilonfor allvvandℓ\\ell\.

##### Additional Remark:

From equation[E\.2](https://arxiv.org/html/2605.06216#A5.E2),∑k=1Kαkℓ=K/\(K\+ezK\+1ℓ\)\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}^\{\\ell\}=K/\(K\+e^\{z\_\{K\+1\}^\{\\ell\}\}\)dependsonlyon the null logitzK\+1ℓz\_\{K\+1\}^\{\\ell\}\. A single large null logit jointly suppresses allKKactive banks through softmax competition facilitating the suppression degree of freedom to one scalar, regardless ofKK\.

## Appendix FFull Proof of Proposition[3\.2](https://arxiv.org/html/2605.06216#S3.Thmtheorem2): K\-Pathway Gradient Amplification

For simplicity, Proposition[3\.2](https://arxiv.org/html/2605.06216#S3.Thmtheorem2)is stated for a simplified router overKKactive banks that excludes the null bank at slotK\+1K\{\+\}1\.

For a fixed tokenv∈𝒱v\\in\\mathcal\{V\}and letev\(k\)e^\{\(k\)\}\_\{v\}denote the row of embedding tableEkE\_\{k\}corresponding tovv, fork=1,…,Kk=1,\\ldots,K\. DefineXs:=𝟏\[v∈batchs\]X\_\{s\}:=\\mathbf\{1\}\[v\\in\\text\{batch\}\_\{s\}\]\. As established in Appendix[C](https://arxiv.org/html/2605.06216#A3)with Bernoulli Bound, we have:

𝔼\[Xs\]=1−\(1−fv\)BT=:κv≤fv⋅BT\.\\mathbb\{E\}\[X\_\{s\}\]=1\-\(1\-f\_\{v\}\)^\{BT\}=:\\kappa\_\{v\}\\leq f\_\{v\}\\cdot BT\.\(F\.1\)Given that theKKMemoryBlocks have no shared parameters, the gradient with respect toev\(k\)e^\{\(k\)\}\_\{v\}is identically zero wheneverv∉batchsv\\notin\\text\{batch\}\_\{s\}, and otherwise:

∇ev\(k\)ℒs=∑ℓ=1Lαkℓ⋅∂ℒs∂mℓ\(v\)⋅∂Mk\(v\)∂ev\(k\),\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}=\\sum\_\{\\ell=1\}^\{L\}\\alpha^\{\\ell\}\_\{k\}\\cdot\\frac\{\\partial\\mathcal\{L\}\_\{s\}\}\{\\partial m^\{\\ell\}\(v\)\}\\cdot\\frac\{\\partial M\_\{k\}\(v\)\}\{\\partial e^\{\(k\)\}\_\{v\}\},\(F\.2\)where∂ℒs/∂mℓ\(v\)\\partial\\mathcal\{L\}\_\{s\}/\\partial m^\{\\ell\}\(v\)is the upstream gradient from layerℓ\\ell’s residual stream\. SinceMk\(v\)M\_\{k\}\(v\)enters every layer, each block accumulates gradient contributions across allLLlayers\.

During training, wheneverv∈batchsv\\in\\text\{batch\}\_\{s\}and the loss has not been minimized on tokenvv, we assume the standard non\-degeneracy condition: for eachkk, there exists at least one layerℓ∗\\ell^\{\*\}such that

‖∇ev\(k\)ℒs‖2≥Gmin2\>0on the event\{v∈batchs\}\.\\left\\lVert\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\geq G^\{2\}\_\{\\min\}\>0\\quad\\text\{on the event \}\\\{v\\in\\text\{batch\}\_\{s\}\\\}\.\(F\.3\)
Since theKKblocks are independent and each satisfies Equation[F\.3](https://arxiv.org/html/2605.06216#A6.E3), and∇ev\(k\)ℒs=𝟎\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}=\\mathbf\{0\}exactly whenXs=0X\_\{s\}=0\(tokenvvabsent from batch\), we have‖∇ev\(k\)ℒs‖2≥Gmin2⋅Xs\\left\\lVert\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\geq G^\{2\}\_\{\\min\}\\cdot X\_\{s\}\. Summing acrossKKblocks:

∑k=1K‖∇ev\(k\)ℒs‖2≥K⋅Gmin2⋅Xs\.\\sum\_\{k=1\}^\{K\}\\left\\lVert\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\;\\geq\\;K\\cdot G^\{2\}\_\{\\min\}\\cdot X\_\{s\}\.\(F\.4\)Taking expectations and summing overτ\\tausteps completes the proof of equation[3\.7](https://arxiv.org/html/2605.06216#S3.E7):

𝔼\[∑s=1τ∑k=1K‖∇ev\(k\)ℒs‖2\]≥K⋅τ⋅κv⋅Gmin2\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=1\}^\{\\tau\}\\sum\_\{k=1\}^\{K\}\\left\\lVert\\nabla\_\{e^\{\(k\)\}\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\right\]\\;\\geq\\;K\\cdot\\tau\\cdot\\kappa\_\{v\}\\cdot G^\{2\}\_\{\\min\}\.\(F\.5\)
Additional Remark:The standard transformer upper bound gives𝔼\[∑s‖∇evℒs‖2\]≤τ⋅fv⋅BT⋅G2\\mathbb\{E\}\\\!\\left\[\\sum\_\{s\}\\left\\lVert\\nabla\_\{e\_\{v\}\}\\mathcal\{L\}\_\{s\}\\right\\rVert^\{2\}\\right\]\\leq\\tau\\cdot f\_\{v\}\\cdot BT\\cdot G^\{2\}from Appendix[C](https://arxiv.org/html/2605.06216#A3)\. TheTIDElower bound isKKtimes the analogous single\-block lower bound, confirming aKK\-fold amplification under the assumptionG2/Gmin2=O\(1\)G^\{2\}/G^\{2\}\_\{\\min\}=O\(1\)\.

## Appendix GAdditional Details for Proposition[3\.3](https://arxiv.org/html/2605.06216#S3.Thmtheorem3): Contextual Collapse andTIDE’sKK\-MemoryBlocks

Proposition[3\.3](https://arxiv.org/html/2605.06216#S3.Thmtheorem3)state that for any collapsed pair\(u,v\)∈𝒞δ\(ℓ\)\(u,v\)\\in\\mathcal\{C\}\_\{\\delta\}^\{\(\\ell\)\}with‖hu\(ℓ\)−hv\(ℓ\)‖≤δ\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVert\\leq\\delta, and any target separationC\>0C\>0, there existEmbeddingMemoryparameters\{Ek\}k=1K\\\{E\_\{k\}\\\}\_\{k=1\}^\{K\}such that‖Mk\(u\)−Mk\(v\)‖=C\\left\\lVert M\_\{k\}\(u\)\-M\_\{k\}\(v\)\\right\\rVert=Cfor anyK≥1K\\geq 1\.

From Equation[3\.3](https://arxiv.org/html/2605.06216#S3.E3), we haveMk\(v\)=RMSNorm\(Ek\[v\]\)M\_\{k\}\(v\)=\\mathrm\{RMSNorm\}\(E\_\{k\}\[v\]\), whereEk\[v\]E\_\{k\}\[v\]is the row ofEk∈ℝ\|𝒱\|×dbE\_\{k\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\_\{b\}\}selected by discrete indexvv\.

The hidden statehv\(ℓ\)h^\{\(\\ell\)\}\_\{v\}does not appear in equation[3\.3](https://arxiv.org/html/2605.06216#S3.E3), soMk\(u\)M\_\{k\}\(u\)andMk\(v\)M\_\{k\}\(v\)are independent ofδ=‖hu\(ℓ\)−hv\(ℓ\)‖\\delta=\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVert\. This stands in direct contrast to the FFN, for which Lipschitz continuity forces for the‖FFN\(hu\)−FFN\(hv\)‖≤LFFNδ\\left\\lVert\\mathrm\{FFN\}\(h\_\{u\}\)\-\\mathrm\{FFN\}\(h\_\{v\}\)\\right\\rVert\\leq L\_\{\\mathrm\{FFN\}\}\\deltaregardless of the weights chosen, bounding the output separation from above byLFFNδL\_\{\\mathrm\{FFN\}\}\\delta\.

Since rowsEk\[u\]E\_\{k\}\[u\]andEk\[v\]E\_\{k\}\[v\]are uncoupled parameters, the tableEkE\_\{k\}assigns one dedicated row per vocabulary entry with no joint constraint, so assigning one row places no restriction on the other, regardless ofδ\\delta\.Mk\(u\)M\_\{k\}\(u\)andMk\(v\)M\_\{k\}\(v\)can be set to any two vectors in the output range ofRMSNorm\\mathrm\{RMSNorm\}independently\. In particular, for any targetC\>0C\>0there trivially exist row assignments such that‖Mk\(u\)−Mk\(v\)‖=C\\left\\lVert M\_\{k\}\(u\)\-M\_\{k\}\(v\)\\right\\rVert=C, regardless ofδ=‖hu\(ℓ\)−hv\(ℓ\)‖\\delta=\\left\\lVert h^\{\(\\ell\)\}\_\{u\}\-h^\{\(\\ell\)\}\_\{v\}\\right\\rVertand independently ofLFFNL\_\{\\mathrm\{FFN\}\}\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x12.png)Figure 11:Marginal contribution of each layer’s memory injection inTIDE\.The routedEmbeddingMemoryoutput at one layerℓ\\ellis zeroed while all other layers retain their full pathway; we report perplexity on WikiText\-2, DCLM, and PubMed\. Relatively higher degradation in PubMed performance also aligns with our rare\-token problem\.
## Appendix HUnderstanding Layer\-wise Contribution ofEmbeddingMemory

InTIDE, we propose to addEmbeddingMemoryto the residual stream at every transformer layer, a natural question arises:*Is each layer’s memory injection equally important, or does it contribute disproportionately at certain depths of the network?*To probe this, we sweep the layer indexℓ∈\{0,1,…,L−1\}\\ell\\in\\\{0,1,\\ldots,L\-1\\\}of ourTIDE\-1B model and, at each sweep point, replace the routed memory contribution at layerℓ\\ellalone with the zero vector while leaving the router andMemoryBlocks pathway at every other layer untouched\. Concretely, the standardTIDEforward pass at layerℓ\\ellis given as per Equation[3\.6](https://arxiv.org/html/2605.06216#S3.E6)as:

hℓ=h~ℓ\+FFN⁡\(n~ℓ\)\+mℓ\(v\)\.h^\{\\ell\}=\\tilde\{h\}^\{\\ell\}\+\\operatorname\{FFN\}\\\!\\bigl\(\\tilde\{n\}^\{\\ell\}\\bigr\)\+m^\{\\ell\}\(v\)\.becomeshℓ=h~ℓ\+FFN⁡\(n~ℓ\)h^\{\\ell\}=\\tilde\{h\}^\{\\ell\}\+\\operatorname\{FFN\}\\\!\\bigl\(\\tilde\{n\}^\{\\ell\}\\bigr\)for the single ablated layerℓ\\ell, whereh~ℓ=RMSNorm\(hℓ\+Attnℓ\(hℓ\)\)\\tilde\{h\}\_\{\\ell\}=\\mathrm\{RMSNorm\}\(h\_\{\\ell\}\+\\mathrm\{Attn\}\_\{\\ell\}\(h\_\{\\ell\}\)\)andMemℓ\\mathrm\{Mem\}\_\{\\ell\}denotes the routed sum over theK=24K\{=\}24MemoryBlocks\. We then evaluate perplexity on WikiText\-2, DCLM, and PubMed, repeating for everyℓ\\ellto obtain a per\-layer marginal\-contribution profile\.

Memory contribution is intermittent, not monotone:Figure[11](https://arxiv.org/html/2605.06216#A7.F11)reveals that the contribution ofEmbeddingMemoryisnot uniform across depthof the trainedTIDEcheckpoint\. Dropping layer0collapses the model entirely and the perplexity rises by more than103%10^\{3\}\\%on every dataset and reaches1\.09×106%1\.09\\times 10^\{6\}\\%on PubMed, indicating that the first memory injection carries an irreplaceable importance for the model\. Layer11remains substantially load\-bearing \(\+8\.1\+8\.1%to\+12\.9%\+12\.9\\%across datasets\), suggesting a brief consolidation period during which token\-identity information is propagated into the residual stream\. After this, degradation falls sharply: every layer in the contiguous rangeℓ∈\[4,12\]\\ell\\in\[4,12\]lead to less than∼\\sim2% PPL cost\.

We additionally observe a clear secondary peak at layer1313, where layer dropping costs new spike in performance degradation\. We interpret this pattern as evidence that token\-identity information injected by earlyEmbeddingMemorylayers*persists*in the residual stream for several intermediate layers, during which any single memory contribution becomes redundant\. Once this token\-identity signal is consumed by the ongoing contextual computation, memory information is again required to refresh it at intermittent intervals\.

## Appendix IUnderstanding Decoding Cost withMemoryBlocks

In this section, we aim to investigate how does our proposed architectureTIDE\-1B empirically perform wrt\. LLaMa\-Base\-1B in terms of decoding speed \(ms/token\)\. All experiments are reported using 1×\\timesB200 GPU averaged across 5,120 generated tokens\.

Table 4:Token Decoding estimated forTIDE\-1B variants in comparison to LLaMa\-Base\-1B transformer model\.LLaMa\-Base\-1BTIDE\-2E\-1BTIDE\-4E\-1BTIDE\-8E\-1BTIDE\-16E\-1BTIDE\-24E\-1BDecoding Speed \(ms/token\)11\.08511\.23611\.85412\.68812\.90113\.422

## Appendix JInvestigating Compression Strategies forEmbeddingMemory

Our proposedTIDEarchitectures provide an opportunity to offload each staticMemoryBlocks within theEmbeddingMemoryto storage devices in resource\-constrained settings with asynchronous pre\-fetching\. This leads to the question:How does the trade\-off between VRAM and storage devices look like and what can be done to minimizeMemoryBlocks storage cost?

We first estimate the total storage cost ofMemoryBlockas follows:

Storage Size=vocab\_size×num\_blocks×hidden\_dim×bits\_per\_param\\text\{Storage Size\}=\\mathrm\{vocab\\\_size\}\\times\\mathrm\{num\\\_blocks\}\\times\\mathrm\{hidden\\\_dim\}\\times\\mathrm\{bits\\\_per\\\_param\}\(J\.1\)For ourTIDE\-8E\-1B model with 24 layers and 2048 hidden dimension which are trained with LLaMa\-3\.1 tokenizer with 128,256 vocabulary size,∼\\sim4\.2 GB of storage space is required forEmbeddingMemorywith F16 precision\. To address our question, we performed a preliminary investigation222An effective novel compression technique forEmbeddingMemorycompression is out of the scope of this work\. Our preliminary investigation reveals a high redundancy within theMemoryBlocks and leaves sophisticated studies to capitalize these redundancies as future work\.of storage challenges ofEmbeddingMemoryfromtwodifferent perspectives:

- D1\.Quantization ofEmbeddingMemory, and
- D2\.Low Rank compression of individualMemoryBlock,

### J\.1Quantization ofEmbeddingMemory\.

Table 5:Performance comparison ofTIDE\-8E\-1B with various low\-precision ofMemoryBlocks\.PrecisionSize \(GB\)Wikitext\-2 \(↓\\downarrow\)PubMed \(↓\\downarrow\)DCLM \(↓\\downarrow\)16\-bit4\.20 GB10\.08811\.10016\.1088\-bit2\.10 GB10\.08911\.10016\.1134\-bit1\.05 GB10\.26311\.27716\.343

### J\.2Low Rank Compression of Token\-wiseMemoryBlocks\.

Several recent works\(li2023losparse;wang2023cuttlefish;kaushal2023lord\)have explored the low\-rank characteristics associated with weights and gradients to address storage demands and computational complexity linked to the large matrices of LLMs\. For ourTIDE\-8E\-1B model checkpoint with 8MemoryBlocks, each block holds an embedding tableMk∈ℝ\|𝒱\|×dbM\_\{k\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\_\{b\}\}that maps a token indexv∈𝒱v\\in\\mathcal\{V\}to adbd\_\{b\}\-dimensional vector\. A rank\-rrSVD decomposition ofMkM\_\{k\}yields two matricesU∈ℝ\|𝒱\|×rU\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times r\}andV∈ℝr×dbV\\in\\mathbb\{R\}^\{r\\times d\_\{b\}\}, and rather than storingMkM\_\{k\}directly we can store the factored representation\(U,V\)\(U,V\)providedrris small enough that the factored form has fewer parameters\. We estimate the rankrr, below which storage of\(U,V\)\(U,V\)will save space as follows:

\(\|𝒱\|⋅r\)\+\(r⋅db\)≤\|𝒱\|⋅db⟹r≤\|𝒱\|⋅db\|𝒱\|\+db\.\(\|\\mathcal\{V\}\|\\cdot r\)\+\(r\\cdot d\_\{b\}\)\\;\\leq\\;\|\\mathcal\{V\}\|\\cdot d\_\{b\}\\quad\\Longrightarrow\\quad r\\;\\leq\\;\\frac\{\|\\mathcal\{V\}\|\\cdot d\_\{b\}\}\{\|\\mathcal\{V\}\|\+d\_\{b\}\}\.\(J\.2\)
ForTIDE\-8E\-1B with hidden dimensiond=2048d=2048, bottleneck dimensiondb=2048d\_\{b\}=2048, and theLLaMA\-3\.1tokenizer of vocabulary size\|𝒱\|=128,256\|\\mathcal\{V\}\|=128\{,\}256, Equation[J\.2](https://arxiv.org/html/2605.06216#A10.E2)givesr≤2015r\\leq 2015, so any uniform rank reduction of at least∼\\sim2% across the eightMemoryBlocks is sufficient to reduce storage relative to the dense parameterization\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x13.png)Figure 12:Uniform rank reduction across all 8MemoryBlocks ofTIDE\-8E\-1B\.EachMk∈ℝ\|𝒱\|×dbM\_\{k\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\_\{b\}\}is replaced by its rank\-rrSVD approximation withr=⌈\(1−p\)⋅db⌉r=\\lceil\(1\-p\)\\cdot d\_\{b\}\\rceilapplied identically to every block\.\(a\)Absolute perplexity on WikiText\-2, DCLM, and PubMed; dotted horizontal lines mark each dataset’s uncompressed baseline\.\(b\)Relative degradation of perplexity which is largely flat through∼\\sim30% reduction, degrades gradually through∼\\sim60%, and rises sharply beyond 70%\.Figure[12](https://arxiv.org/html/2605.06216#A10.F12)sweeps the uniform reduction percentage from 0% to 90% in 10% increments and reports perplexity on WikiText\-2, DCLM, and PubMed\. At modest reductions \(10–30%,r∈\[1434,1844\]r\\in\[1434,1844\]\), perplexity degradation remains almost flat while parameter count perMemoryBlockdrops to as little as 71% of the dense form\. At moderate reductions \(40–60%,r∈\[820,1229\]r\\in\[820,1229\]\), degradation grows around∼\\sim10% to∼\\sim25% but remains gradual\. However, beyond 70% reduction the curves bend sharply upward and the relativeΔ\\DeltaPPLreaches 587% on WikiText\-2, 484% on DCLM, and 657% on PubMed\. Our findings indicate that aTIDEMemoryBlocks can be significantly compressed up to a∼\\sim50% rank reduction \(r=1024r=1024, halving the per\-block parameter count\) at uniform with marginal drop in performance for practical purpose with limited resource availability\. Provided the existence of non\-uniform low\-rank properties across different layers\(jaiswal2024galore\), we strongly believe thatEmbeddingMemorycan be further compressed using non\-uniform rank reduction techniques for relatively superior performance compared to uniform SVD\.

## Appendix KK\-Nearest Neighbor Study of Base Embedding andMemoryBlocks

To understand Base andMemoryBlockembeddings from semantic perspective, we ask an interesting question:At individual token\-level, do theMemoryBlocks recover semantic neighbors that the primary embeddingEEhas failed to learn during training?To probe this, we compute the per\-token Jaccard overlapJkJ\_\{k\}between the top\-10 cosine neighbors across 200 randomly sampled rare and common token underEEand eachMkM\_\{k\}\.

In Figure[13](https://arxiv.org/html/2605.06216#A11.F13), we present the distribution ofJkJ\_\{k\}over tokens and find that rare\-token boxes lie consistently below the common\-token boxes for everyKK\. It indicates that for common tokens, the neighbor sets agree more closely withEEbut for rare tokens neighbor sets aresubstantially disjointadding complementary and non\-overlapping information about them\. For example, in Table[6](https://arxiv.org/html/2605.06216#A11.T6)with concrete examples: for a rare tokenasynchronously,EE’s top\-10 are dominated by adverbs ending in \{\-ly\}; but eachMemoryBlocks contribute additional closely related technical information \(*e\.g\.*M2M\_\{2\}surfaces*Asynchronous JavaScript and XML, callbacks*;M3M\_\{3\}surfaces*defensively, securely*\) in pursuit to enhance the semantic structure\.

The rare\-name queryfredshows the same pattern:EEreturns only first\-name neighbors, while individualMemoryBlocks additionally recover orthographic variants \(Fred,Frederick,Freddy\), tokenizer fragments \(Freder\), and cross\-lingual variants \(Hans,Viktor\)\. The complementary\-information picture therefore not a global statistical artifact but reflects genuine per\-token specialization ofMemoryBlocks\.

![Refer to caption](https://arxiv.org/html/2605.06216v1/x14.png)Figure 13:Cosine\-Nearest Agreement between primary embeddingEEand memory blocksMkM\_\{k\}across for*rare*and*common*tokens\. Rare\-token boxes lie consistently below common\-token boxes indicating memory blocks pathways encode neighbor sets that aresubstantially disjointfromEEfor the rare tokens and add complementary new information to model\.Table 6:Top\-10 Cosine\-Nearest neighbors of two rare query tokens in the primary embedding tableEEand across 8MemoryBlockinTIDE\-8E\-1B \(M1,M2,…,M8M\_\{1\},M\_\{2\},\\ldots,M\_\{8\}\) model checkpoints\. Row backgrounds encode per\-block Jaccard rank against the Base top\-10 within each query, with*darker*shades indicating*higher*JkJ\_\{k\}\(more neighbor\-set agreement with Base\)\.QueryPathwayTop\-10 nearest neighbors \(cosine\)𝐉𝐤\\mathbf\{J\_\{k\}\}BaseEEasynchronous,ynchronously,sequentially,recursively,dynamically,ynchronous,concurrently,independently,horizontally,seamlessly–M1M\_\{1\}asynchronous,ynchronously,synchronous,ynchronous,silently,globally,callback,concurrently,digitally,unsafe0\.25M2M\_\{2\}asynchronous,ynchronously,Asynchronous JavaScript and XML,callbacks,synchronous,ynchronous,LSD,hashtags,conditionally,breathable0\.18M3M\_\{3\}ynchronously,asynchronous,recursively,ynchronous,sequentially,defensively,resonate,dynamically,callbacks,securely0\.43M4M\_\{4\}asynchronous,ynchronously,concurrently,synchronous,ynchronous,dynamically,anonymously,externally,parallel,simultaneously0\.33M5M\_\{5\}asynchronous,ynchronously,simultaneously,recursively,spontaneously,sequentially,efficiently,tirelessly,separately,independently0\.33M6M\_\{6\}asynchronous,synchronous,Asynchronous JavaScript and XML,instantiated,serialize,scalable,serialized,ynchronously,caching0\.11M7M\_\{7\}asynchronous,ynchronously,manually,Premiere,optionally,reordered,indefinitely,RMS,factorial,charcoal0\.11asynchronouslyM8M\_\{8\}asynchronous,synchronous,ynchronously,ynchronous,dynamically,synchronized,concurrently,electronically,simultaneously,automatically0\.33BaseEEFred,Fred,Larry,Roger,Doug,Charlie,Sean,jim,Mike,Jake–M1M\_\{1\}Fred,Fred,joe,john,Ginny,Frederick,Doug,Lena,Woody,zar0\.18M2M\_\{2\}Fred,Fred,alf,Freder,Maggie,Carlo,Viktor,alan,Noel,Amit0\.11M3M\_\{3\}Fred,Fred,fred,mary,bob,Bob,Frederick,Bob,Freder,Herbert0\.11M4M\_\{4\}Fred,Fred,martin,Martin,Zack,Bernie,Frederick,alex,Charlie,Albert0\.18M5M\_\{5\}Fred,Fred,christ,Nora,brahim,Dani,tek,Enterprise,Yo,Practice0\.11M6M\_\{6\}Fred,Fred,Katy,Hans,Ogre,Nel,Sing,Ian,Berk,Toby0\.11M7M\_\{7\}Fred,Fred,fred,Ron,Doug,COST,Todd,Evelyn,Lindsey,Apache0\.11fredM8M\_\{8\}Fred,Fred,fred,Frederick,Freddy,Ned,Freddie,Geoff,Fried,red0\.11

## Appendix LModel Training Implementation Details

![Refer to caption](https://arxiv.org/html/2605.06216v1/x15.png)Figure 14:Training Loss Comparison of LLaMa\-1B and TIDE\-1B with 2, 4, 8, 16, and 24MemoryBlocks\.Table 7:Model training configurations for ourBase\\mathrm\{Base\},TIDEModels\. All model checkpoints are trained within the paper adopt exactly same configuration for fair comparison\.CategoryKeyValueCommonTokens Count400\-500 BillionVocabulary size128,256Tokenizermeta\-llama/Llama\-3\.1\-8BDatasetmlfoundations/dclm\-baseline\-1\.0Sequence Length2048Hidden ActivationSiLULossNameCross EntropyZ\-loss1\.0e\-6OptimizerNameAdamWeight Decay0\.1Beta10\.9Beta20\.95SchedularWarmup Initial LR1e\-06Warmup Iterations10000TypeCosineMax LR1\.0e\-04Min LR1\.0e\-05
## Appendix MLimitations and Future Work

While TIDE delivers consistent gains across model scales and downstream tasks, we would like to acknowledge some limitations\. ❶Storage overhead:Despite the EmbeddingMemory tables are static and quantization friendly, the SSD footprint still scales linearly withKK\. Deployments with strict storage budgets need to rely on the compression strategies discussed in Appendix[J](https://arxiv.org/html/2605.06216#A10), along with conventional techniques\(jaiswal2023emergence;jaiswal2024ffn;li2023losparse;yin2023outlier\)to reduce SSD overhead\. ❷ Our experiments cover model scales from 750M to 3B parameters trained on 200–500B tokens of DCLM, with evaluations on WikiText, PubMed, DCLM, and eight zero\-shot benchmarks\. The benefits ofTIDEremain unexplored for longer training horizons, after instruction tuning or RLHF and is left to future work\. ❸ Per\-block router statistics and nearest\-neighbor analyses \(Appendix[6](https://arxiv.org/html/2605.06216#A11.T6)\) suggest that distinct MemoryBlocks specialize to distinct frequency regimes, we do not provide a principled account of*what*each block learns\. A more fine\-grained interpretability study of memoryblocks specialization is an important direction for future work\.
TIDE: Every Layer Knows the Token Beneath the Context

Similar Articles

Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Submit Feedback

Similar Articles

Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis