Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

arXiv cs.LG Papers

Summary

This paper proposes RankElastor, a novel architecture that mitigates embedding collapse in dense scaling of recommendation models by introducing parameterized full mixing and GLU-improved P-FFNs, achieving robust scaling and improved performance on large-scale datasets.

arXiv:2605.23191v1 Announce Type: new Abstract: Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile-paskardlgm/RankElastor
Original Article
View Cached Full Text

Cached at: 05/25/26, 09:02 AM

# Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
Source: [https://arxiv.org/html/2605.23191](https://arxiv.org/html/2605.23191)
\(2026\)

###### Abstract\.

Scaling recommendation models is a central challenge in recommender systems\. Recently,RankMixerhas emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per\-token feedforward networks \(P\-FFNs\) to achieve scalable performance\. However, RankMixer suffers fromembedding collapse, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space\. Through empirical analysis and theoretical insights, we identify rigid token mixing and P\-FFN modules as the primary causes of this phenomenon, jointly inducing adamped oscillatory trajectoryin effective\-rank evolution across layers\. To address it, we proposeRankElastor, a novel architecture that produces spectrum\-robust representations with provable collapse mitigation\. RankElastor introduces two components: \(i\)parameterized full mixing, which enables expressive token mixing with improved spectral robustness; and \(ii\)GLU\-improved P\-FFNs, which stabilize representation spectra through GLU\-style FFN modules\. Extensive experiments on large\-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior\. Code is available at this GitHub repository:[https://github\.com/vasile\-paskardlgm/RankElastor](https://github.com/vasile-paskardlgm/RankElastor)

Recommender Systems, CTR prediction, Dimensional Collapse

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818049††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Computing methodologies Machine learning## 1\.Introduction

Recommender systems are a core application of machine learning, aiming to predict user–item interactions from massive multi\-field categorical data\(Zhang et al\.,[2016](https://arxiv.org/html/2605.23191#bib.bib43)\)\. They have become indispensable in modern digital platforms, powering applications such as e\-commerce, social media, and content recommendation\. Recent advances in deep learning–based recommenders enable flexible feature representation learning and complex feature interaction modeling, leading to strong performance in large\-scale industrial deployments\.

Inspired by the success of large foundation models\(Radford et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib26); Achiam et al\.,[2023](https://arxiv.org/html/2605.23191#bib.bib2); Rombach et al\.,[2022](https://arxiv.org/html/2605.23191#bib.bib28); Kirillov et al\.,[2023](https://arxiv.org/html/2605.23191#bib.bib20)\), scaling model capacity has emerged as a natural direction for recommender systems\(Zhang et al\.,[2024b](https://arxiv.org/html/2605.23191#bib.bib41),[a](https://arxiv.org/html/2605.23191#bib.bib42); Guo et al\.,[2024b](https://arxiv.org/html/2605.23191#bib.bib9); Wang et al\.,[2025a](https://arxiv.org/html/2605.23191#bib.bib36)\)\. Among recent scaling\-oriented architectures,RankMixer\(Zhu et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib46)\)has attracted significant attention\. The RankMixer architecture can be understood as three core components: \(i\) atokenizationmodule that maps heterogeneous embeddings from multiple fields into a unified token representation space; \(ii\) atoken mixingmodule that augments token representations by combining the original token matrix with its block\-transposed view; and \(iii\) aper\-token FFN\(P\-FFN\) module that models feature interactions using independent feedforward networks applied to each token\(Kira and Rendell,[1992](https://arxiv.org/html/2605.23191#bib.bib19); Batory et al\.,[2011](https://arxiv.org/html/2605.23191#bib.bib5)\)\. By iteratively alternating the token mixing and P\-FFN modules, RankMixer forms a deep and scalable recommender architecture that achievesstate\-of\-the\-artperformance on multiple benchmarks as well as practical deployments\.

Despite its empirical success, RankMixer has not been examined from the perspective ofembedding collapse\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10)\)\. Embedding collapse has recently been identified as a fundamental obstacle in scaling recommender systems\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Pan et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib25); Chen et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib6); Zhang et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib44); Yin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib40)\), where learned representations concentrate in low\-rank subspaces, limiting representation diversity and reducing the benefits of model scaling\. While prior work has attempted to mitigate collapse via architectural modifications\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10)\)or optimization strategies\(Lin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib22); Wang et al\.,[2025b](https://arxiv.org/html/2605.23191#bib.bib37)\), these solutions are typically designed for specific recommenders\. A systematic understanding of embedding collapse in RankMixer remains missing\.

In this paper, we investigate RankMixer through the lens ofeffective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.23191#bib.bib29); Vershynin,[2011](https://arxiv.org/html/2605.23191#bib.bib35)\), a spectral measure that captures representation capacity beyond algebraic rank and generalizes prior collapse analysis tools in recommenders\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Yin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib40)\)\. Empirically, we observe that RankMixer exhibits a distinctivedamped oscillatory trajectoryin rank evolution across layers: token\-mixing modules slightly increase effective rank, while P\-FFN modules contract it\. Although this alternating behavior yields modest improvements over conventional recommenders \(e\.g\., DCNv2\(Wang et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib38)\)\), which typically show monotonic rank decay, the gains remain limited and do not reliably prevent representation collapse as model depth increases\. Complementing these observations, our theoretical analysis shows that the original token\-mixing and P\-FFN modules in RankMixer collectively constrain spectral robustness, explaining why collapse mitigation remains incomplete\.

To tackle this drawback and unlock the potential of the token\-transformation\-based recommenders, we propose RankElastor , a novel architecture designed to produce spectrum\-robust representations with provable collapse mitigation\. RankElastor introduces two key components: \(i\)parameterized full mixing, which performs learnable fine\-grained mixing over tokens to improve spectral expressiveness; and \(ii\)GLU\-improved P\-FFNs, which employ gated activations\(Shazeer,[2020](https://arxiv.org/html/2605.23191#bib.bib30)\)to prevent collapse amplification induced by the original RankMixer P\-FFNs\. Together, these modules produce more stable representation spectra and improved scaling behavior\.

We evaluate RankElastor against strong recommender baselines, including RankMixer , on industrial\-scale benchmarks Criteo\(Jean\-Baptiste Tien,[2014](https://arxiv.org/html/2605.23191#bib.bib15)\)and Avazu\(Steve Wang,[2014](https://arxiv.org/html/2605.23191#bib.bib33)\), enabling a practically aligned comparison\. Experimental results show that RankElastor consistently improves downstream CTR prediction performance, achieving over0\.001 AUC gainagainst the strongest baseline — a superior improvement according to\(Zhou et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib45)\)\. Beyond accuracy improvements, RankElastor produces representations with markedly higher effective rank, indicating stronger mitigation of representation collapse\. In addition, RankElastor exhibits markedly better parameter scaling behavior than RankMixer , confirming its advantage as a scalable recommender architecture\. Our contributions are summarized below:

- •We analyze RankMixer from the perspective of embedding collapse, providing both empirical evidence and theoretical justification showing that collapse remains insufficiently addressed in the architecture\.
- •We propose RankElastor , a novel deep recommender architecture that mitigates representation collapse through parameterized full mixing and GLU\-improved P\-FFNs, with theoretical guarantees on spectral robustness\.
- •We conduct extensive experiments on industrial\-scale benchmarks demonstrating that RankElastor improves recommendation performance, representation diversity, and parameter scaling behavior, providing a new direction for scaling recommender architectures\.

## 2\.Research Background

### 2\.1\.Notations and Preliminaries

Recommendation models aim to predict user actions based on features drawn from multiple fields\(Zhang et al\.,[2016](https://arxiv.org/html/2605.23191#bib.bib43)\)\. In line with the application scenario considered in this paper, namelyCTR/ranking prediction\(McMahan et al\.,[2013](https://arxiv.org/html/2605.23191#bib.bib24)\), we considernnfields, where theii\-th field is denoted as𝒳i\\mathcal\{X\}\_\{i\}, and define the joint feature space as𝒳=𝒳1×𝒳2×⋯×𝒳n\\mathcal\{X\}=\\mathcal\{X\}\_\{1\}\\times\\mathcal\{X\}\_\{2\}\\times\\dots\\times\\mathcal\{X\}\_\{n\}\. Let𝒴\\mathcal\{Y\}denote the prediction space; the goal of a recommendation model is to learn a mapping from𝒳\\mathcal\{X\}to𝒴\\mathcal\{Y\}\.

We focus on recommenders following the “embedding\-interaction” architecture\. Such models first transform an input sampleX∈𝒳X\\in\\mathcal\{X\}into afeature embeddingE∈ℝn×kE\\in\\mathbb\{R\}^\{n\\times k\}, wherekkis the embedding dimension and theii\-th rowEiE\_\{i\}denotes the embedding vector for fieldii\. The embedding matrix is then processed by afeature interactionmodule\(Kira and Rendell,[1992](https://arxiv.org/html/2605.23191#bib.bib19); Batory et al\.,[2011](https://arxiv.org/html/2605.23191#bib.bib5)\), which models correlations across fields and produces informative representations for prediction\. This architecture is widely adopted in real\-world recommender systems due to its strong empirical performance\(Wang et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib38); Lian et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib21); Song et al\.,[2019](https://arxiv.org/html/2605.23191#bib.bib31); Rendle,[2010](https://arxiv.org/html/2605.23191#bib.bib27)\)\.

#### Embedding Collapse\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10)\)\.

In general machine learning settings, dimensional collapse refers to models degenerating into trivial representations that map inputs to nearly constant outputs\(Hua et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib14)\), which can be characterized through spectral analysis of learned representations\(Jing et al\.,[2022](https://arxiv.org/html/2605.23191#bib.bib16)\)\. In recommender systems, a related phenomenon known asembedding collapsehas recently been identified: the embedding matrix becomes approximately low\-rank with multiple near\-zero singular values\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Pan et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib25); Chen et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib6); Zhang et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib44)\)\.

This phenomenon is commonly measured usingeffective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.23191#bib.bib29)\), which characterizes the spectral distribution of a matrix beyond algebraic rank\. Among several formulations, we adopt a widely used norm\-based definition\(Vershynin,[2011](https://arxiv.org/html/2605.23191#bib.bib35); Balduzzi et al\.,[2017](https://arxiv.org/html/2605.23191#bib.bib3); Bartlett et al\.,[2020](https://arxiv.org/html/2605.23191#bib.bib4)\), also known as the stable rank\(Cohen et al\.,[2016](https://arxiv.org/html/2605.23191#bib.bib7)\):

\(1\)erank⁡\(X\)=∑iσi2maxi⁡σi2=‖X‖F2‖X‖22,\\displaystyle\\operatorname\{erank\}\(X\)=\\frac\{\\sum\_\{i\}\\sigma^\{2\}\_\{i\}\}\{\\max\_\{i\}\\sigma^\{2\}\_\{i\}\}=\\frac\{\\\|X\\\|\_\{F\}^\{2\}\}\{\\\|X\\\|\_\{2\}^\{2\}\}\\ ,whereσi\\sigma\_\{i\}denotes the singular values ofXX\.

#### RankMixer\(Zhu et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib46)\)\.

As a recent deep recommender architecture, RankMixer has attracted substantial attention due to its streamlined design, strong empirical performance, and favorable scaling behavior\. Its architecture consists of three modules: \(i\)tokenization; \(ii\)token mixing; and \(iii\)per\-token FFNs\(P\-FFNs\)\. We briefly summarize them below:

- •Tokenization\.The tokenization module takes the embedding matrixEEand groups embedding vectors intoTTclusters, followed by a unified projection \(e\.g\., linear mapping or MLP\) into a sharedDD\-dimensional space, producing a token matrixX\(0\)∈ℝT×DX^\{\(0\)\}\\in\\mathbb\{R\}^\{T\\times D\}\. Each rowXt\(0\)∈ℝDX^\{\(0\)\}\_\{t\}\\in\\mathbb\{R\}^\{D\}is called a token embedding\. This module is primarily used for aligning embeddings from heterogeneous sources\. In this paper, we focus on the subsequent core modules\.
- •Token mixing\.The token mixing module partitions the column dimensionDDintoHHequal segments of sized=D/Hd=D/H, forming aT×TT\\times Tgrid of blocks, each inℝ1×d\\mathbb\{R\}^\{1\\times d\}\. At thell\-th RankMixer block, a block\-transpose operation is first applied to the inputX\(l\)X^\{\(l\)\}, exchanging block pairsXi,j\(l\)X^\{\(l\)\}\_\{i,j\}andXj,i\(l\)X^\{\(l\)\}\_\{j,i\}\. The transformed representation is then combined with the original token matrix through a residual connection, followed by output layer normalization, producing enhanced token representations with cross\-token information mixing\.
- •P\-FFNs\.RankMixer then sends the mixed tokens into the token\-specific FFN modeuls, which is called P\-FFN\. This module serves as the function for performing feature interaction as those in typical “embedding\-interaction” type recommenders\. In RankMixer , the P\-FFN is configured as a two layer with GELU\(Hendrycks and Gimpel,[2023](https://arxiv.org/html/2605.23191#bib.bib12)\)as activation function\.

Stacking token mixing and P\-FFN modules yields a deep and scalable recommender architecture\. An overview of the RankMixer architecture is shown in Figure[3](https://arxiv.org/html/2605.23191#S3.F3)\(a\)\.

![Refer to caption](https://arxiv.org/html/2605.23191v1/x1.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x2.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x4.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x5.png)\(a\)Raw Embedding→\\toMixing 1
![Refer to caption](https://arxiv.org/html/2605.23191v1/x6.png)\(b\)Mixing 1→\\toFFN 1
![Refer to caption](https://arxiv.org/html/2605.23191v1/x7.png)\(c\)FFN 1→\\toMixing 2
![Refer to caption](https://arxiv.org/html/2605.23191v1/x8.png)\(d\)Mixing 2→\\toFFN 2

Figure 1\.Distributional shift of effective rank across representation stages in RankMixer \. Distributions of per\-sample effective rank from raw embeddings to successive module outputs on Criteo \(top\) and Avazu \(bottom\)\.![Refer to caption](https://arxiv.org/html/2605.23191v1/x9.png)\(a\)Comparison on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x10.png)\(b\)Comparison on Avazu\.

Figure 2\.Comparison of average effective rank across representation stages for RankMixer \(damped oscillatory trajectory\) and baselines \(monotonic decay\)\.

### 2\.2\.On Embedding Collapse in RankMixer

Although RankMixer demonstrates strong scalability and effectiveness, its behavior under embedding collapse has not been systematically studied\. Existing collapse analyses primarily focus on conventional recommenders\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Yin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib40)\), such as DCNv2\(Wang et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib38)\)and xDeepFM\(Lian et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib21)\)\. In this section, we analyze the effective\-rank dynamics of RankMixer representations across its core modules \(token mixing and P\-FFNs\), combining empirical observations with theoretical justification\.

#### Empirical Demonstrations\.

We conduct CTR prediction experiments on two industrial\-scale benchmarks, Criteo\(Jean\-Baptiste Tien,[2014](https://arxiv.org/html/2605.23191#bib.bib15)\)and Avazu\(Steve Wang,[2014](https://arxiv.org/html/2605.23191#bib.bib33)\), using the FuxiCTR framework\(Zhu et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib47)\)\. We compare RankMixer \(with two token\-mixing blocks and two P\-FFN blocks\) against DCNv2 and xDeepFM configured with comparable depth, resulting in four\-layer recommenders for fair comparison\. We then visualize the average effective rank of module outputs across test samples\. All remaining experimental settings follow those described in Section[4](https://arxiv.org/html/2605.23191#S4)\.

While full recommendation performance results are reported in Section[4](https://arxiv.org/html/2605.23191#S4), we focus here on effective\-rank dynamics\. As shown in Figure[2](https://arxiv.org/html/2605.23191#S2.F2), conventional models exhibit amonotonic decayin effective rank with depth; in contrast, RankMixer shows an alternating pattern: token mixing increases effective rank, whereas P\-FFNs shrink it, producing adamped oscillatory trajectory\. Although such sawtooth pattern allows RankMixer to maintain slightly higher effective rank than conventional models, the improvement remains limited and does not eliminate collapse\.

On Criteo, the effective rank at the final layer only marginally exceeds that of the raw embeddings, indicating that the rank\-enhancement effect of token mixing is modest in practice\. This limitation becomes more pronounced on Avazu \(Figure[2\(b\)](https://arxiv.org/html/2605.23191#S2.F2.sf2)\), where contractions introduced by P\-FFNs dominate across layers, causing collapse to re\-emerge despite the presence of token mixing\.

Since higher effective rank typically indicates better utilization of representation capacity and correlates with improved recommendation performance\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Yin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib40)\), these results reveal both the advantage and limitation of RankMixer from the perspective of embedding collapse: although the architecture improves representation rank compared with conventional recommenders, it does not fundamentally prevent collapse\.

#### Theoretical Justifications\.

We further complement the empirical analysis with theoretical justification\. Both the effective\-rank improvement and the remaining collapse limitation of RankMixer can be traced to its two modules: token mixing and P\-FFNs\. The following theorems provide theoretical justification for these effects\.

###### Theorem 2\.1 \(Effective rank under block\-transpose mixing\)\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}be a representation matrix withD=T​dD=Tdandd≥1d\\geq 1\. PartitionXXinto aT×TT\\times Tgrid of blocksXi​j∈ℝ1×dX\_\{ij\}\\in\\mathbb\{R\}^\{1\\times d\}\. Define the block\-transpose operator𝒯\\mathcal\{T\}by\(𝒯​\(X\)\)i​j=Xj​i\(\\mathcal\{T\}\(X\)\)\_\{ij\}=X\_\{ji\}, and letY=𝒯​\(X\)Y=\\mathcal\{T\}\(X\)\. Denote the algebraic rank and effective rank ofXXbyrank⁡\(X\)=r\\operatorname\{rank\}\(X\)=randerank⁡\(X\)=k\\operatorname\{erank\}\(X\)=k\. AssumeXXsatisfies Frobenius\-orthogonality and spectral incoherence with its block transpose:

⟨X,Y⟩F≈0and‖X\+Y‖22≈max⁡\(‖X‖22,‖Y‖22\)\.\\langle X,Y\\rangle\_\{F\}\\approx 0\\quad\\mathrm\{and\}\\quad\\\|X\+Y\\\|\_\{2\}^\{2\}\\approx\\max\(\\\|X\\\|\_\{2\}^\{2\},\\\|Y\\\|\_\{2\}^\{2\}\)\\ \.LetM=X\+YM=X\+Yand defineμ=erank⁡\(Y\)\\mu=\\operatorname\{erank\}\(Y\)\. Then

2​k​μ\(k\+μ\)2≤erank⁡\(M\)≤2​\(k\+μ\),\\frac\{2k\\mu\}\{\(\\sqrt\{k\}\+\\sqrt\{\\mu\}\)^\{2\}\}\\leq\\operatorname\{erank\}\(M\)\\leq 2\(k\+\\mu\)\\ ,whereμ≤rank⁡\(Y\)≤min⁡\{T,r​d\}\\mu\\leq\\operatorname\{rank\}\(Y\)\\leq\\min\\\{T,rd\\\}\.

###### Theorem 2\.2 \(Effective rank failure of standard FFNs\)\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}be a representation matrix with effective rankerank⁡\(X\)=kandkD≤γ,\\operatorname\{erank\}\(X\)=k\\quad\\text\{and\}\\quad\\frac\{k\}\{D\}\\leq\\gamma,for some fixed constantγ∈\(0,1\)\\gamma\\in\(0,1\)independent ofDD\. Consider the per\-row feedforward networkℱ​\(X\)=ϕ​\(X​A\)​B,\\mathcal\{F\}\(X\)=\\phi\(XA\)B,whereA∈ℝD×mA\\in\\mathbb\{R\}^\{D\\times m\}andB∈ℝm×DB\\in\\mathbb\{R\}^\{m\\times D\}have i\.i\.d\. sub\-Gaussian entries with variance1/D1/D, andϕ\\phiis positively homogeneous\. Then:

- •\(Deterministic collapse\) IfXXhas an algebraic rank11, thenrank⁡\(ℱ​\(X\)\)≤2\\operatorname\{rank\}\(\\mathcal\{F\}\(X\)\)\\leq 2, anderank⁡\(ℱ​\(X\)\)≤2\\operatorname\{erank\}\(\\mathcal\{F\}\(X\)\)\\leq 2, with equality=1=1when all rows ofXXhave the same sign\.
- •\(Probabilistic contraction\) If the pre\-activations satisfy a nontrivial response\-gap condition \(See Appendix[B](https://arxiv.org/html/2605.23191#A2)\), then with probability at least1−exp⁡\(−c​k\)1\-\\exp\(\-ck\),erank⁡\(ℱ​\(X\)\)≤α​erank⁡\(X\),\\operatorname\{erank\}\(\\mathcal\{F\}\(X\)\)\\leq\\alpha\\,\\operatorname\{erank\}\(X\),for some constantα∈\(0,1\)\\alpha\\in\(0,1\)depending only onγ\\gammaand activationϕ\\phi\.

Theorem[2\.1](https://arxiv.org/html/2605.23191#S2.Thmtheorem1)explains why token mixing can increase effective rank\. The block\-transpose mixing produces a representation whose effective rank admits the lower bound2​k​μ\(k\+μ\)2\.\\frac\{2k\\mu\}\{\(\\sqrt\{k\}\+\\sqrt\{\\mu\}\)^\{2\}\}\.showing that mixing introduces a rank\-expansion effect\. However, the theorem also reveals an intrinsic limitation: the achievable improvement is bounded byerank⁡\(X\+Y\)≤2​\(k\+μ\)\.\\operatorname\{erank\}\(X\+Y\)\\leq 2\(k\+\\mu\)\.Since recommender embeddings are often already spectrally collapsed at initialization stages\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Pan et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib25); Kang et al\.,[2026](https://arxiv.org/html/2605.23191#bib.bib17)\), this bounded expansion typically results in only modest rank improvement in practice\. This explains the small rank increase observed after token mixing in Figures[1\(a\)](https://arxiv.org/html/2605.23191#S2.F1.sf1)and[1\(c\)](https://arxiv.org/html/2605.23191#S2.F1.sf3)\.

Theorem[2\.2](https://arxiv.org/html/2605.23191#S2.Thmtheorem2)characterizes the complementary effect of P\-FFNs\. When the input representation is already low\-rank, the FFN module can further contract effective rank, either deterministically in the rank\-1 case or probabilistically generic rank\-kk\. This explains the repeated effective\-rank decreases observed in Figures[1\(b\)](https://arxiv.org/html/2605.23191#S2.F1.sf2)and[1\(d\)](https://arxiv.org/html/2605.23191#S2.F1.sf4)\.

#### Summary\.

Taken together, the empirical and theoretical analyses reveal a consistent picture of spectra dynamics in RankMixer \. Although RankMixer maintains higher effective rank than conventional recommenders, the improvement is inherently limited and does not reliably prevent collapse from re\-emerging across layers\. This limitation arises from the architecture itself: token mixing provides bounded rank expansion, while P\-FFNs exhibit rank\-contractive behavior, jointly producing the characteristic damped oscillatory trajectory across depth\.

Importantly, this diagnosis suggests a principled direction for architectural refinement —Expand More and Shrink Less, enabling token mixing to produce stronger spectrum expansion while making P\-FFNs less spectrum\-reductive\. Motivated by this insight, we propose RankElastor in the next section\.

## 3\.Method: RankElastor

![Refer to caption](https://arxiv.org/html/2605.23191v1/x11.png)Figure 3\.Architectural overview of \(a\) RankMixer and \(b\) RankElastor \.Purplemodules indicate trainable components, whileGraymodules indicate non\-parametric components\. The key difference between the models lies in token\-mixing and P\-FFNs\.In this section, we present RankElastor , a novel deep recommender architecture that generalizes the token\-mixing and P\-FFN modules of RankMixer through establishing collapse\-robust token transformation\. We focus on two core architectural components in RankElastor : \(i\)Parameterized Full Token Mixingand \(ii\)GLU\-improved P\-FFNs, both supported by theoretical justification\. An overview of RankElastor is illustrated in Figure[3](https://arxiv.org/html/2605.23191#S3.F3)\(b\)\.

### 3\.1\.Parameterized Full Token Mixing

In RankMixer , token mixing is implemented as a non\-parameteric block\-transposed operation\. This operation is equivalent to

\(2\)vec⁡\(M⊤\)=LN⁡\(\(P⊗I\)​vec⁡\(X⊤\)\+vec⁡\(X⊤\)\),\\displaystyle\\operatorname\{vec\}\(M^\{\\top\}\)=\\operatorname\{LN\}\\big\(\(P\\otimes I\)\\operatorname\{vec\}\(X^\{\\top\}\)\+\\operatorname\{vec\}\(X^\{\\top\}\)\\big\)\\ ,whereP∈ℝT2×T2P\\in\\mathbb\{R\}^\{T^\{2\}\\times T^\{2\}\}is a permutation operator structured as acommutation matrix\(Magnus and Neudecker,[1979](https://arxiv.org/html/2605.23191#bib.bib23)\),I∈ℝDT×DTI\\in\\mathbb\{R\}^\{\\frac\{D\}\{T\}\\times\\frac\{D\}\{T\}\}denotes the identity matrix,vec\\operatorname\{vec\}denotes thevec operator\(Henderson and Searle,[1981](https://arxiv.org/html/2605.23191#bib.bib11)\),LN\\operatorname\{LN\}denotes layer normalization, and⊗\\otimesis theKronecker product\(Horn and Johnson,[2012](https://arxiv.org/html/2605.23191#bib.bib13)\)\.

This representation reveals an important structural property: token mixing is governed by the permutation operatorPPand the grid scaled=DTd=\\frac\{D\}\{T\}\. Modifying either the permutation operator or the grid scale leads to different token\-mixing behaviors\. Motivated by this observation, we generalize Eq\.[2](https://arxiv.org/html/2605.23191#S3.E2)by introducing a learnable mixing operator and reducing the grid scale to the finest resolution\. This leads to the proposedparameterized full mixing:

\(3\)vec⁡\(M⊤\)=LN⁡\(\(W\+I\)​vec⁡\(X⊤\)\),\\displaystyle\\operatorname\{vec\}\(M^\{\\top\}\)=\\operatorname\{LN\}\\big\(\(W\+I\)\\operatorname\{vec\}\(X^\{\\top\}\)\\big\)\\ ,whereW∈ℝT​D×T​DW\\in\\mathbb\{R\}^\{TD\\times TD\}is a learnable mixing matrix andIIpreserves residual information flow\.

This formulation corresponds to the fine\-grained cased=1d=1with fully parameterized mixing weights, enabling interactions across all token–feature coordinates\. Such generalization strengthens the token\-mixing module by allowing the model to refine representations that may otherwise lead to spectral collapse under restricted block\-wise transformations\. We formalize this expressivity advantage in the following theorem\.

###### Theorem 3\.1 \(Expressivity of parameterized full mixing\)\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}be an input representation matrix andx=vec⁡\(X\)∈ℝNx=\\operatorname\{vec\}\(X\)\\in\\mathbb\{R\}^\{N\}withN=T​DN=TD\. For a divisord∗d^\{\\ast\}ofNN, partitionxxintoK=N/d∗K=N/d^\{\\ast\}blocksℬ=\{𝐛1,…,𝐛K\}\\mathcal\{B\}=\\\{\\mathbf\{b\}\_\{1\},\\dots,\\mathbf\{b\}\_\{K\}\\\}, where𝐛k∈ℝd∗\\mathbf\{b\}\_\{k\}\\in\\mathbb\{R\}^\{d^\{\\ast\}\}\. A parameterized block\-mixing scheme with learnable weights𝐖∈ℝK×K\\mathbf\{W\}\\in\\mathbb\{R\}^\{K\\times K\}produces𝐲=\(𝐖⊗𝐈d∗\)​𝐱\\mathbf\{y\}=\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)\\,\\mathbf\{x\}and the residual output𝐂=Φd∗​\(𝐗;𝐖\)≜𝐗\+reshape⁡\(\(𝐖⊗𝐈d∗\)​vec⁡\(𝐗\)\)\.\\mathbf\{C\}=\\Phi\_\{d^\{\\ast\}\}\(\\mathbf\{X\};\\mathbf\{W\}\)\\triangleq\\mathbf\{X\}\+\\operatorname\{reshape\}\\\!\\left\(\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)\\operatorname\{vec\}\(\\mathbf\{X\}\)\\right\)\.Then for anyd∗\>1d^\{\\ast\}\>1, the set of linear transformations realizable byΦd∗\\Phi\_\{d^\{\\ast\}\}is strictly contained in that of the parameterized full mixing case \(d∗=1d^\{\\ast\}=1\)\. In particular, the Kronecker constraint\(𝐖⊗𝐈d∗\)\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)prevents expressing fine\-grained, high\-rank coordinate interactions that are attainable whend∗=1d^\{\\ast\}=1\.

The theorem’s justification is presented in Appendix[C](https://arxiv.org/html/2605.23191#A3)\. The theorem shows that when the grid scale satisfiesd∗\>1d^\{\\ast\}\>1, there always exist nontrivial inputs whose token\-mixing outputs remain spectrally constrained\. In contrast, the fine\-grained cased∗=1d^\{\\ast\}=1, corresponding to our parameterized full mixing, removes the Kronecker constraint and enables richer coordinate interactions, improving robustness against representational collapse\.

### 3\.2\.GLU\-improved P\-FFNs

Although parameterized full mixing improves the expressivity of token representations at the matrix level—in terms of effective rank—its collapse mitigation effect is primarily statistical rather than deterministic\. In practical recommendation scenarios, highly skewed or degenerate inputs may still lead to degraded representations\. To further improve robustness, we refine the second core module of RankMixer , namely the P\-FFN\.

We replace the GELU\-based FFN with a GLU\-style gated feed\-forward module \(bias omitted for clarity\):

\(4\)Zt=\(GELU⁡\(Mt​W1\)⊙\(Mt​W2\)\)​W3\+Mt​Wr,\\displaystyle Z\_\{t\}=\\big\(\\operatorname\{GELU\}\(M\_\{t\}W\_\{1\}\)\\odot\(M\_\{t\}W\_\{2\}\)\\big\)W\_\{3\}\+M\_\{t\}W\_\{r\}\\ ,whereW1,W2∈ℝD×r​DW\_\{1\},W\_\{2\}\\in\\mathbb\{R\}^\{D\\times rD\}are lifting projections,W3∈ℝr​D×DW\_\{3\}\\in\\mathbb\{R\}^\{rD\\times D\}is the compression projection,rris the expansion ratio,WrW\_\{r\}is a learnable residual mapping, and⊙\\odotdenotes theHadamard product\(Strang,[2022](https://arxiv.org/html/2605.23191#bib.bib34)\)\. This design follows the gated activation principle of GLU\(Shazeer,[2020](https://arxiv.org/html/2605.23191#bib.bib30)\), which has been shown to improve expressivity and representation quality in modern deep architectures\(Song et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib32); Yang et al\.,[2024](https://arxiv.org/html/2605.23191#bib.bib39); Fishman et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib8)\)\.

We next show that the GLU\-improved P\-FFN provides stronger effective rank recovery capability than the GELU\-based FFN used in RankMixer \.

###### Theorem 3\.2 \(Effective rank recovery via GLU\-improved P\-FFNs\)\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}satisfyerank⁡\(X\)=k\\operatorname\{erank\}\(X\)=kwithk/D≤γk/D\\leq\\gammafor a fixed constantγ∈\(0,1\)\\gamma\\in\(0,1\)\. Consider the GLU\-improved P\-FFN with residual

𝒢​\(X\)=\(ϕ​\(X​A\)⊙\(X​C\)\)​B\+X​D,\\mathcal\{G\}\(X\)=\\big\(\\phi\(XA\)\\odot\(XC\)\\big\)B\+XD,whereA,C∈ℝD×mA,C\\in\\mathbb\{R\}^\{D\\times m\}andB∈ℝm×DB\\in\\mathbb\{R\}^\{m\\times D\}have i\.i\.d\. sub\-Gaussian entries with variance1/D1/D, and⊙\\odotdenotes the Hadamard product\. If the hidden width satisfiesm≥C​k​log⁡Dm\\geq Ck\\log D, then with probability at least1−exp⁡\(−c​k\)1\-\\exp\(\-ck\):

- •\(Algebraic lifting\) The multiplicative term induces degree\-2 interactions, andrank⁡\(ϕ​\(X​A\)⊙\(X​C\)\)≥min⁡\(D,k​\(k\+1\)2\)\.\\operatorname\{rank\}\\\!\\big\(\\phi\(XA\)\\odot\(XC\)\\big\)\\;\\geq\\;\\min\\\!\\left\(D,\\frac\{k\(k\+1\)\}\{2\}\\right\)\.
- •\(Effective rank increase\) The output satisfieserank⁡\(𝒢​\(X\)\)≥erank⁡\(X\)\+δ,\\operatorname\{erank\}\(\\mathcal\{G\}\(X\)\)\\geq\\operatorname\{erank\}\(X\)\+\\delta,for someδ\>0\\delta\>0depending only onγ\\gammaand initialization constants\.

The theorem’s justification is presented in Appendix[D](https://arxiv.org/html/2605.23191#A4)\. This result shows that GLU\-improved P\-FFNs can statistically maintain spectrum robustness through multiplicative feature interaction, whereas conventional activation\-based FFNs may still exhibit nontrivial failure cases that lead to collapse in effective rank\. Consequently, the proposed P\-FFN improves robustness against representational collapse while maintaining the standard feed\-forward computation structure used in RankMixer \.

### 3\.3\.Complexity Analysis

We compare RankMixer and RankElastor via the computational and parameter complexity of token mixing and P\-FFN modules\.

#### On token mixing\.

The token mixing module in RankMixer is implemented as a block\-transpose permutation, which incurs𝒪​\(T​D\)\\mathcal\{O\}\(TD\)computation and introduces no additional parameters\. In contrast, RankElastor employsParameterized Full Mixing, which applies a dense linear transformation over the flattened representation\. This results in𝒪​\(T2​D2\)\\mathcal\{O\}\(T^\{2\}D^\{2\}\)computation and𝒪​\(T2​D2\)\\mathcal\{O\}\(T^\{2\}D^\{2\}\)parameters\.

#### On P\-FFN

The P\-FFN module in RankMixer has computational complexity𝒪​\(T​r​D2\)\\mathcal\{O\}\(TrD^\{2\}\)and parameter complexity𝒪​\(r​D2\)\\mathcal\{O\}\(rD^\{2\}\)\. The RankElastor P\-FFN adopts aGLU\-improved P\-FFN, which increases the feed\-forward cost to𝒪​\(T​\(3​r​D2\+D2\)\)\\mathcal\{O\}\\big\(T\(3rD^\{2\}\+D^\{2\}\)\\big\)with parameter complexity\(3​r\+1\)​D2\(3r\+1\)D^\{2\}\. This introduces a constant\-factor increase compared with the P\-FFN in RankMixer \.

Overall, the primary complexity difference between RankMixer and RankElastor arises from the token\-mixing module, where RankElastor introduces additional parameters and computation through Parameterized Full Mixing\. The GLU\-improved P\-FFN contributes a smaller, constant\-factor increase in complexity\.

## 4\.Experiments

In this section, we conduct extensive benchmark experiments to answer the following research questions:

- •RQ1:Does RankElastor outperform existing baselines on standard CTR prediction benchmarks?
- •RQ2:How effectively does RankElastor mitigate embedding collapse compared with RankMixer ?
- •RQ3:How does RankElastor scale compared with RankMixer ?
- •RQ4:To what extent does RankElastor generalize beyond CTR prediction tasks?

Table 1\.Statistics of benchmark datasets\.### 4\.1\.Setup

##### Datasets\.

We conduct experiments on two industrial\-scale CTR prediction benchmarks, Criteo\(Jean\-Baptiste Tien,[2014](https://arxiv.org/html/2605.23191#bib.bib15)\)and Avazu\(Steve Wang,[2014](https://arxiv.org/html/2605.23191#bib.bib33)\), which are widely used for evaluating recommendation models in practical settings\. Dataset statistics are summarized in Table[1](https://arxiv.org/html/2605.23191#S4.T1)\.

##### Metrics\.

We report AUC and LogLoss, following standard evaluation protocols for CTR prediction\(Guo et al\.,[2024a](https://arxiv.org/html/2605.23191#bib.bib10); Yin et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib40); Wang et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib38); Lian et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib21); Song et al\.,[2019](https://arxiv.org/html/2605.23191#bib.bib31)\)\. To evaluate representation collapse, we report effective rank defined in Eq\.[1](https://arxiv.org/html/2605.23191#S2.E1)\.

##### Baselines\.

We compare RankElastor with RankMixer and several representative feature\-interaction models that demonstrate strong performance in real\-world CTR systems, including xDeepFM\(Lian et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib21)\), DCNv2\(Wang et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib38)\), AutoInt\(Song et al\.,[2019](https://arxiv.org/html/2605.23191#bib.bib31)\), and a standard DNN \(MLP\)\. For both RankElastor and RankMixer , we use a two\-layer architecture \(i\.e\., two stacked blocks\)\. We set\(T,D\)=\(15,26\)\(T,D\)=\(15,26\)on Criteo and\(T,D\)=\(16,24\)\(T,D\)=\(16,24\)on Avazu\. The hidden dimension of the P\-FFNs is set toDD, and the GLU\-improved P\-FFNs in RankElastor use an expansion ratior=3r=3\. For the remaining baselines, we adopt the recommended configurations provided in the open\-source FuxiCTR library\(Zhu et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib47)\)\.

##### Experimental protocol\.

All models are implemented using the open\-source FuxiCTR benchmark library\(Zhu et al\.,[2021](https://arxiv.org/html/2605.23191#bib.bib47)\), following its officially recommended training pipeline\. We set the embedding dimension to 20 for Criteo and 16 for Avazu\. Models are trained for up to 100 epochs with early stopping triggered when validation loss does not improve for two consecutive epochs\. We use batch sizes of 4096 for Criteo and 10000 for Avazu\. Each reported result is averaged over 10 runs with different random initializations to ensure statistical reliability\.

Table 2\.Overall CTR prediction performance of RankElastor and baseline models on Criteo and Avazu\.Table 3\.Module\-wise ablation study for token mixing and P\-FFN in RankElastor and RankMixer \.![Refer to caption](https://arxiv.org/html/2605.23191v1/x12.png)\(a\)Results on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x13.png)\(b\)Results on Avazu\.

Figure 4\.Efficiency comparison of RankElastor and RankMixer , alongside other baselines for reference\.![Refer to caption](https://arxiv.org/html/2605.23191v1/x14.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x16.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x17.png)
![Refer to caption](https://arxiv.org/html/2605.23191v1/x18.png)\(a\)Raw Embedding→\\toMixing 1
![Refer to caption](https://arxiv.org/html/2605.23191v1/x19.png)\(b\)Mixing 1→\\toFFN 1
![Refer to caption](https://arxiv.org/html/2605.23191v1/x20.png)\(c\)FFN 1→\\toMixing 2
![Refer to caption](https://arxiv.org/html/2605.23191v1/x21.png)\(d\)Mixing 2→\\toFFN 2

Figure 5\.Distributional shift of per\-sample effective rank across representation stages in RankElastor , from raw embeddings to successive modules on Criteo \(top\) and Avazu \(bottom\)\. Results correspond to the illustration in Figure[1](https://arxiv.org/html/2605.23191#S2.F1)\.

### 4\.2\.RQ1: CTR Prediction Performance

#### Quantitative comparisons\.

We first present the quantitative performance comparison between RankElastor and baseline models\. As shown in Table[2](https://arxiv.org/html/2605.23191#S4.T2), RankElastor consistently achieves better performance than all baselines on both AUC and LogLoss, including RankMixer \. In particular, RankElastor improves AUC by up to0\.001over the strongest competing model, which is considered a statistically meaningful improvement on industrial\-scale CTR benchmarks\(Zhou et al\.,[2018](https://arxiv.org/html/2605.23191#bib.bib45)\)\. These results validate the effectiveness of the proposed refinements and the resulting RankElastor architecture\. Notably, the parameter size of RankElastor remains comparable to other baselines, indicating that the performance gain is achieved without introducing excessive model complexity\.

![Refer to caption](https://arxiv.org/html/2605.23191v1/x22.png)\(a\)Comparison on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x23.png)\(b\)Comparison on Avazu\.

Figure 6\.Layer\-wise comparison of average effective rank, from raw embeddings to module outputs, over test samples for RankElastor and RankMixer \. Results align with Figure[2](https://arxiv.org/html/2605.23191#S2.F2)\.
#### Module contribution analysis\.

We further conduct ablation studies to evaluate the contribution of the proposed modules in RankElastor , together with corresponding ablations applied to RankMixer \. As shown in Table[3](https://arxiv.org/html/2605.23191#S4.T3), both Parameterized Full Mixing and the GLU\-improved P\-FFN contribute significantly to the performance of RankElastor , and removing either component leads to consistent performance degradation\. Moreover, the two modules exhibit a clear collaborative effect: combining them yields substantially larger improvements than using either module alone\. A similar modification applied to RankMixer \(e\.g\., replacing the FFN with a GLU\-style activation\) results in only marginal performance gains, suggesting that the benefits of the GLU\-improved P\-FFN depend on the enhanced token\-mixing capability introduced by parameterized full mixing\. These observations are consistent with our theoretical analysis, which highlights the complementary roles of the two modules in mitigating embedding collapse\.

![Refer to caption](https://arxiv.org/html/2605.23191v1/x24.png)\(a\)Width\-wise dense parameter scaling on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x25.png)\(b\)Depth\-wise dense parameter scaling on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x26.png)\(c\)Width\-wise dense parameter scaling on Avazu\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x27.png)\(d\)Depth\-wise dense parameter scaling on Avazu\.

Figure 7\.Dense parameter scaling trends under width \(left\) and depth \(right\) scaling for RankMixer and RankElastor \. The x\-axis shows the parameter scaling factor relative to the base model\. Markers denote empirical measurements, while dashed lines indicate fitted scaling curves capturing the performance–parameter scaling relationship\.
#### Efficiency comparison\.

We evaluate the efficiency of RankElastor against RankMixer in terms oftraining time per epochandGPU memory usage, with results shown in Figure[4](https://arxiv.org/html/2605.23191#S4.F4)\. RankElastor introduces a modest 10%–15% increase in runtime compared to RankMixer , while maintaining similar GPU memory requirements and remaining competitive with other efficient baselines such as DCNv2\. This aligns with the complexity analysis presented in Section[3\.3](https://arxiv.org/html/2605.23191#S3.SS3), indicating that the additional computational cost is practically negligible for large\-scale recommendation\. Overall, RankElastor achieves a favorable tradeoff, offering superior recommendation performance with only minor efficiency overhead\.

### 4\.3\.RQ2: Embedding Collapse Mitigation

We now examine how the effective rank evolves in RankElastor in comparison with RankMixer \. Specifically, we analyze both\(i\)theeffective\-rank distribution shiftand\(ii\)theaverage effective rank across representation stages, providing a comprehensive view of representation diversity throughout the models\.

#### Effective\-rank distribution shift\.

As shown in Figure[5](https://arxiv.org/html/2605.23191#S4.F5), we visualize per\-sample effective\-rank distributions from raw embeddings to successive module outputs on Criteo \(top\) and Avazu \(bottom\)\. Compared with the corresponding results of RankMixer in Figure[1](https://arxiv.org/html/2605.23191#S2.F1), RankElastor exhibits substantially larger effective\-rank distribution shifts across internal representations\. While both models start with nearly identical effective\-rank distributions at the raw embedding stage, their behaviors diverge as representations propagate through the network\. In particular, the parameterized full mixing in RankElastor produces noticeably larger effective\-rank gains than the block\-transposed mixing in RankMixer , and the GLU\-enhanced P\-FFN maintains stronger rank preservation compared with the P\-FFN used in RankMixer \.

#### “Expand more, shrink less”\.

Figure[6](https://arxiv.org/html/2605.23191#S4.F6)shows the layer\-wise average effective rank for both models, corresponding to Figure[2](https://arxiv.org/html/2605.23191#S2.F2)\. We observe that the spectral dynamics of RankElastor follow the same characteristic alternating behavior previously identified in RankMixer : token mixing layers tend to expand the representation spectrum and increase effective rank, whereas P\-FFN layers introduce rank contraction\. However, compared with RankMixer , RankElastor consistently exhibits noticeably stronger expansion effects together with substantially milder contraction after each P\-FFN stage\. This behavior leads to more stable spectral evolution across layers, allowing RankElastor to consistently maintain higher average effective rank than RankMixer across datasets and representation stages\. Notably, on Avazu — where RankMixer exhibits clear collapse tendencies — RankElastor preserves steady effective\-rank growth throughout the network, indicating improved robustness against representation collapse\.

Overall, these results confirm that RankElastor mitigates collapse more effectively than RankMixer , producing more stable and expressive internal representations across layers\.

### 4\.4\.RQ3: Dense Parameter Scaling Analysis

To evaluate the scaling behavior of RankElastor as a scalable deep recommender paradigm, we conduct scaling\-law experiments on its dense parameters, including those in Parameterized Full Mixing and GLU\-improved P\-FFNs\. We vary\(i\)depth\(number of blocks\) and\(ii\)width\(hidden size of FFNs\), and examine both individual and joint scaling schemes, with comparisons to RankMixer \.

Figure[7](https://arxiv.org/html/2605.23191#S4.F7)shows the results of width\-wise \(left\) and depth\-wise \(right\) scaling, while Figure[8](https://arxiv.org/html/2605.23191#S4.F8)presents the joint depth and width scaling results\. Across all configurations, RankElastor exhibits substantially better scaling behavior than RankMixer , with larger AUC improvements and lower LogLoss as model parameters increase\. These results indicate that RankElastor can serve as a more effective paradigm for deep recommender systems, benefiting from its theoretically grounded design that mitigates embedding collapse while enhancing downstream performance\.

Furthermore, comparing joint scaling \(Figure[8](https://arxiv.org/html/2605.23191#S4.F8)\) with individual depth\- or width\-scaling \(Figure[7](https://arxiv.org/html/2605.23191#S4.F7)\), we observe that increasing both depth and width consistently yields larger performance gains than scaling a single dimension\. This behavior aligns with findings in language models\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.23191#bib.bib18)\)and prior observations in RankMixer\(Zhu et al\.,[2025](https://arxiv.org/html/2605.23191#bib.bib46)\), demonstrating the consistent but more significant scaling properties of RankElastor \.

![Refer to caption](https://arxiv.org/html/2605.23191v1/x28.png)\(a\)Joint width and depth dense parameter scaling on Criteo\.
![Refer to caption](https://arxiv.org/html/2605.23191v1/x29.png)\(b\)Joint depth and width dense parameter scaling on Avazu\.

Figure 8\.Dense parameter scaling trends under joint width and depth scaling for RankMixer and RankElastor \.
### 4\.5\.RQ4: Behavior Sequence Modeling Generalization

To further evaluate the generalization capability of RankElastor beyond standard CTR prediction, we additionally assess our method onbehavior sequence modelingtasks\. Specifically, we conduct experiments on two datasets from the FuxiCTR benchmark: KuaiVideo \(K\), which predicts users’ click probabilities on short videos, and TaobaoAd \(T\), which models user shopping behaviors\. Following the benchmark protocol, we report both AUC and Group\-AUC \(gAUC\), and compare RankElastor against several strong baselines, including RankMixer \.

As shown in Table[4](https://arxiv.org/html/2605.23191#S4.T4), RankElastor consistently achieves the best performance across both datasets and evaluation metrics, demonstrating clear advantages over all competing methods\. Moreover, we observe similar spectral dynamics improvements to those shown in Figure[6](https://arxiv.org/html/2605.23191#S4.F6), suggesting that both our analytical framework and the proposed unified mixing design generalize effectively beyond standard CTR estimation tasks\. These results highlight the potential of our approach to broader recommendation scenarios\.

Table 4\.Comparison between RankElastor and RankMixer on behavior sequence modeling tasks\. \(K\) and \(T\) denote results on KuaiVideo and TaobaoAd datasets, respectively\.

## 5\.Conclusion

#### Paper summary\.

In this paper, we analyzed RankMixer from the perspective of embedding collapse and showed that, although it provides modest improvements in effective rank compared to conventional recommenders, these gains are typically insufficient to prevent collapse, highlighting limitations in its core token\-mixing and per\-token FFN modules\. Motivated by these insights, we proposed RankElastor , a novel deep recommender that enhances RankMixer with \(i\) Parameterized Full Mixing and \(ii\) GLU\-improved per\-token FFNs, which provably enable more expressive representation transformations and produce spectrum\-robust token embeddings\. Extensive experiments on industrial\-scale benchmarks demonstrate that RankElastor consistently achieves superior performance over strong baselines, including RankMixer , while maintaining higher effective rank and more expressive, non\-collapsed representations\. Moreover, scaling\-law analyses confirm that RankElastor exhibits better scaling behavior, highlighting its potential as a robust and scalable paradigm for large\-scale recommendation systems\.

#### Limitations and future work\.

The effective\-rank dynamics identified in this work — particularly the damped oscillatory trajectory observed in token\-transformation\-based recommenders such as RankMixer and RankElastor — are inherently architecture\-dependent, and our theoretical analysis focuses on this class of models\. Nevertheless, the strong empirical performance of RankMixer \-style architectures suggests substantial room for further exploration\. Our study provides a spectral perspective on representation collapse in deep recommenders and points to promising directions for future work, including more expressive token\-mixing mechanisms, spectrum\-preserving nonlinear modules, and deeper investigation of spectral dynamics in this architecture\.

## Acknowledgments

This work was supported by the Tencent Rhino\-Bird Focused Research Project under Grant No\. P00660\. We thank the anonymous reviewers for their valuable feedback and constructive suggestions\. We also appreciate the open\-source community and benchmark maintainers whose resources supported this research\.

## References

- \(1\)
- Achiam et al\.\(2023\)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.2023\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*\(2023\)\.
- Balduzzi et al\.\(2017\)David Balduzzi, Marcus Frean, Lennox Leary, J\. P\. Lewis, Kurt Wan\-Duo Ma, and Brian McWilliams\. 2017\.The Shattered Gradients Problem: If resnets are the answer, then what is the question?\. In*Proceedings of the 34th International Conference on Machine Learning**\(Proceedings of Machine Learning Research, Vol\. 70\)*, Doina Precup and Yee Whye Teh \(Eds\.\)\. PMLR, 342–350\.[https://proceedings\.mlr\.press/v70/balduzzi17b\.html](https://proceedings.mlr.press/v70/balduzzi17b.html)
- Bartlett et al\.\(2020\)Peter L\. Bartlett, Philip M\. Long, Gábor Lugosi, and Alexander Tsigler\. 2020\.Benign overfitting in linear regression\.*Proceedings of the National Academy of Sciences*117, 48 \(2020\), 30063–30070\.arXiv:https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1907378117[doi:10\.1073/pnas\.1907378117](https://doi.org/10.1073/pnas.1907378117)
- Batory et al\.\(2011\)Don Batory, Peter Höfner, and Jongwook Kim\. 2011\.Feature interactions, products, and composition\. In*Proceedings of the 10th ACM International Conference on Generative Programming and Component Engineering*\(Portland, Oregon, USA\)*\(GPCE ’11\)*\. Association for Computing Machinery, New York, NY, USA, 13–22\.[doi:10\.1145/2047862\.2047867](https://doi.org/10.1145/2047862.2047867)
- Chen et al\.\(2024\)Huiyuan Chen, Vivian Lai, Hongye Jin, Zhimeng Jiang, Mahashweta Das, and Xia Hu\. 2024\.Towards Mitigating Dimensional Collapse of Representations in Collaborative Filtering\. In*Proceedings of the 17th ACM International Conference on Web Search and Data Mining*\(Merida, Mexico\)*\(WSDM ’24\)*\. Association for Computing Machinery, New York, NY, USA, 106–115\.[doi:10\.1145/3616855\.3635832](https://doi.org/10.1145/3616855.3635832)
- Cohen et al\.\(2016\)Michael B\. Cohen, Jelani Nelson, and David P\. Woodruff\. 2016\.Optimal approximate matrix product in terms of stable rank\.arXiv:1507\.02268 \[cs\.DS\][https://arxiv\.org/abs/1507\.02268](https://arxiv.org/abs/1507.02268)
- Fishman et al\.\(2025\)Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry\. 2025\.Scaling FP8 training to trillion\-token LLMs\.arXiv:2409\.12517 \[cs\.LG\][https://arxiv\.org/abs/2409\.12517](https://arxiv.org/abs/2409.12517)
- Guo et al\.\(2024b\)Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, Kenan Song, Kefan Wang, Wenjia Xie, Yuyang Ye, Huifeng Guo, Yong Liu, Defu Lian, Ruiming Tang, and Enhong Chen\. 2024b\.Scaling New Frontiers: Insights into Large Recommendation Models\.arXiv:2412\.00714 \[cs\.IR\][https://arxiv\.org/abs/2412\.00714](https://arxiv.org/abs/2412.00714)
- Guo et al\.\(2024a\)Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long\. 2024a\.On the Embedding Collapse when Scaling up Recommendation Models\. In*Forty\-first International Conference on Machine Learning*\.[https://openreview\.net/forum?id=aPVwOAr1aW](https://openreview.net/forum?id=aPVwOAr1aW)
- Henderson and Searle \(1981\)Harold V\. Henderson and S\. R\. Searle\. 1981\.The vec\-permutation matrix, the vec operator and Kronecker products: a review\.*Linear and Multilinear Algebra*9, 4 \(1981\), 271–288\.arXiv:https://doi\.org/10\.1080/03081088108817379[doi:10\.1080/03081088108817379](https://doi.org/10.1080/03081088108817379)
- Hendrycks and Gimpel \(2023\)Dan Hendrycks and Kevin Gimpel\. 2023\.Gaussian Error Linear Units \(GELUs\)\.arXiv:1606\.08415 \[cs\.LG\][https://arxiv\.org/abs/1606\.08415](https://arxiv.org/abs/1606.08415)
- Horn and Johnson \(2012\)Roger A Horn and Charles R Johnson\. 2012\.*Matrix analysis*\.Cambridge university press\.
- Hua et al\.\(2021\)Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao\. 2021\.On Feature Decorrelation in Self\-Supervised Learning\. In*Proceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\)*\. 9598–9608\.
- Jean\-Baptiste Tien \(2014\)Olivier Chapelle Jean\-Baptiste Tien, joycenv\. 2014\.Display Advertising Challenge\.[https://kaggle\.com/competitions/criteo\-display\-ad\-challenge](https://kaggle.com/competitions/criteo-display-ad-challenge)
- Jing et al\.\(2022\)Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian\. 2022\.Understanding Dimensional Collapse in Contrastive Self\-supervised Learning\. In*International Conference on Learning Representations*\.[https://openreview\.net/forum?id=YevsQ05DEN7](https://openreview.net/forum?id=YevsQ05DEN7)
- Kang et al\.\(2026\)Yu Kang, Junwei Pan, Jipeng Jin, Shudong Huang, Xiaofeng Gao, and Lei Xiao\. 2026\.Towards Unifying Feature Interaction Models for Click\-Through Rate Prediction\. In*Machine Learning and Knowledge Discovery in Databases\. Research Track*, Rita P\. Ribeiro, Bernhard Pfahringer, Nathalie Japkowicz, Pedro Larrañaga, Alípio M\. Jorge, Carlos Soares, Pedro H\. Abreu, and João Gama \(Eds\.\)\. Springer Nature Switzerland, Cham, 451–467\.
- Kaplan et al\.\(2020\)Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\. 2020\.Scaling Laws for Neural Language Models\.arXiv:2001\.08361 \[cs\.LG\][https://arxiv\.org/abs/2001\.08361](https://arxiv.org/abs/2001.08361)
- Kira and Rendell \(1992\)Kenji Kira and Larry A\. Rendell\. 1992\.The feature selection problem: traditional methods and a new algorithm\. In*Proceedings of the Tenth National Conference on Artificial Intelligence*\(San Jose, California\)*\(AAAI’92\)*\. AAAI Press, 129–134\.
- Kirillov et al\.\(2023\)Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C\. Berg, Wan\-Yen Lo, Piotr Dollar, and Ross Girshick\. 2023\.Segment Anything\. In*Proceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\)*\. 4015–4026\.
- Lian et al\.\(2018\)Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun\. 2018\.xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems\. In*Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*\(London, United Kingdom\)*\(KDD ’18\)*\. Association for Computing Machinery, New York, NY, USA, 1754–1763\.[doi:10\.1145/3219819\.3220023](https://doi.org/10.1145/3219819.3220023)
- Lin et al\.\(2025\)Zhutian Lin, Junwei Pan, Haibin Yu, Xi Xiao, Ximei Wang, Zhixiang Feng, Shifeng Wen, Shudong Huang, Dapeng Liu, and Lei Xiao\. 2025\.Crocodile: Cross Experts Covariance for Disentangled Learning in Multi\-Domain Recommendation\. In*Proceedings of the 34th ACM International Conference on Information and Knowledge Management*\(Seoul, Republic of Korea\)*\(CIKM ’25\)*\. Association for Computing Machinery, New York, NY, USA, 1839–1849\.[doi:10\.1145/3746252\.3761332](https://doi.org/10.1145/3746252.3761332)
- Magnus and Neudecker \(1979\)Jan R\. Magnus and H\. Neudecker\. 1979\.The Commutation Matrix: Some Properties and Applications\.*The Annals of Statistics*7, 2 \(1979\), 381 – 394\.[doi:10\.1214/aos/1176344621](https://doi.org/10.1214/aos/1176344621)
- McMahan et al\.\(2013\)H\. Brendan McMahan, Gary Holt, D\. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica\. 2013\.Ad click prediction: a view from the trenches\. In*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*\(Chicago, Illinois, USA\)*\(KDD ’13\)*\. Association for Computing Machinery, New York, NY, USA, 1222–1230\.[doi:10\.1145/2487575\.2488200](https://doi.org/10.1145/2487575.2488200)
- Pan et al\.\(2024\)Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang\. 2024\.Ads Recommendation in a Collapsed and Entangled World\. In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*\(Barcelona, Spain\)*\(KDD ’24\)*\. Association for Computing Machinery, New York, NY, USA, 5566–5577\.[doi:10\.1145/3637528\.3671607](https://doi.org/10.1145/3637528.3671607)
- Radford et al\.\(2021\)Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever\. 2021\.Learning Transferable Visual Models From Natural Language Supervision\. In*Proceedings of the 38th International Conference on Machine Learning**\(Proceedings of Machine Learning Research, Vol\. 139\)*, Marina Meila and Tong Zhang \(Eds\.\)\. PMLR, 8748–8763\.[https://proceedings\.mlr\.press/v139/radford21a\.html](https://proceedings.mlr.press/v139/radford21a.html)
- Rendle \(2010\)Steffen Rendle\. 2010\.Factorization Machines\. In*2010 IEEE International Conference on Data Mining*\. 995–1000\.[doi:10\.1109/ICDM\.2010\.127](https://doi.org/10.1109/ICDM.2010.127)
- Rombach et al\.\(2022\)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer\. 2022\.High\-Resolution Image Synthesis With Latent Diffusion Models\. In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. 10684–10695\.
- Roy and Vetterli \(2007\)Olivier Roy and Martin Vetterli\. 2007\.The effective rank: A measure of effective dimensionality\. In*2007 15th European Signal Processing Conference*\. 606–610\.
- Shazeer \(2020\)Noam Shazeer\. 2020\.GLU Variants Improve Transformer\.arXiv:2002\.05202 \[cs\.LG\][https://arxiv\.org/abs/2002\.05202](https://arxiv.org/abs/2002.05202)
- Song et al\.\(2019\)Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang\. 2019\.AutoInt: Automatic Feature Interaction Learning via Self\-Attentive Neural Networks\. In*Proceedings of the 28th ACM International Conference on Information and Knowledge Management*\(Beijing, China\)*\(CIKM ’19\)*\. Association for Computing Machinery, New York, NY, USA, 1161–1170\.[doi:10\.1145/3357384\.3357925](https://doi.org/10.1145/3357384.3357925)
- Song et al\.\(2024\)Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen\. 2024\.Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters\.arXiv:2406\.05955 \[cs\.LG\][https://arxiv\.org/abs/2406\.05955](https://arxiv.org/abs/2406.05955)
- Steve Wang \(2014\)Will Cukierski Steve Wang\. 2014\.Click\-Through Rate Prediction\.[https://kaggle\.com/competitions/avazu\-ctr\-prediction](https://kaggle.com/competitions/avazu-ctr-prediction)
- Strang \(2022\)Gilbert Strang\. 2022\.*Introduction to linear algebra*\.SIAM\.
- Vershynin \(2011\)Roman Vershynin\. 2011\.Introduction to the non\-asymptotic analysis of random matrices\.arXiv:1011\.3027 \[math\.PR\][https://arxiv\.org/abs/1011\.3027](https://arxiv.org/abs/1011.3027)
- Wang et al\.\(2025a\)Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng\. 2025a\.Scaling Transformers for Discriminative Recommendation via Generative Pretraining\. In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2*\(Toronto ON, Canada\)*\(KDD ’25\)*\. Association for Computing Machinery, New York, NY, USA, 2893–2903\.[doi:10\.1145/3711896\.3737117](https://doi.org/10.1145/3711896.3737117)
- Wang et al\.\(2025b\)Jiancheng Wang, Mingjia Yin, Hao Wang, and Enhong Chen\. 2025b\.Enhancing CTR Prediction with De\-correlated Expert Networks\.arXiv:2505\.17925 \[cs\.IR\][https://arxiv\.org/abs/2505\.17925](https://arxiv.org/abs/2505.17925)
- Wang et al\.\(2021\)Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi\. 2021\.DCN V2: Improved Deep & Cross Network and Practical Lessons for Web\-scale Learning to Rank Systems\. In*Proceedings of the Web Conference 2021*\(Ljubljana, Slovenia\)*\(WWW ’21\)*\. Association for Computing Machinery, New York, NY, USA, 1785–1797\.[doi:10\.1145/3442381\.3450078](https://doi.org/10.1145/3442381.3450078)
- Yang et al\.\(2024\)Jaewoo Yang, Hayun Kim, and Younghoon Kim\. 2024\.Mitigating Quantization Errors Due to Activation Spikes in GLU\-Based LLMs\.arXiv:2405\.14428 \[cs\.CL\][https://arxiv\.org/abs/2405\.14428](https://arxiv.org/abs/2405.14428)
- Yin et al\.\(2025\)Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen\. 2025\.From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models\. In*Forty\-second International Conference on Machine Learning*\.[https://openreview\.net/forum?id=DatAXrGzlc](https://openreview.net/forum?id=DatAXrGzlc)
- Zhang et al\.\(2024b\)Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Shen Li, Yanli Zhao, Yuchen Hao, Yantao Yao, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen\. 2024b\.Wukong: Towards a Scaling Law for Large\-Scale Recommendation\. In*Forty\-first International Conference on Machine Learning*\.[https://openreview\.net/forum?id=8iUgr2nuwo](https://openreview.net/forum?id=8iUgr2nuwo)
- Zhang et al\.\(2024a\)Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji\-Rong Wen\. 2024a\.Scaling Law of Large Sequential Recommendation Models\. In*Proceedings of the 18th ACM Conference on Recommender Systems*\(Bari, Italy\)*\(RecSys ’24\)*\. Association for Computing Machinery, New York, NY, USA, 444–453\.[doi:10\.1145/3640457\.3688129](https://doi.org/10.1145/3640457.3688129)
- Zhang et al\.\(2016\)Weinan Zhang, Tianming Du, and Jun Wang\. 2016\.Deep Learning over Multi\-field Categorical Data\. In*Advances in Information Retrieval*, Nicola Ferro, Fabio Crestani, Marie\-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello \(Eds\.\)\. Springer International Publishing, Cham, 45–57\.
- Zhang et al\.\(2025\)Yabin Zhang, Jiakai Tang, and Xu Chen\. 2025\.Alleviating Dimensional Collapse Problem in Deep Recommender Models by Designing Uniformity Layers\. In*Database Systems for Advanced Applications*, Makoto Onizuka, Jae\-Gil Lee, Yongxin Tong, Chuan Xiao, Yoshiharu Ishikawa, Sihem Amer\-Yahia, H\. V\. Jagadish, and Kejing Lu \(Eds\.\)\. Springer Nature Singapore, Singapore, 148–163\.
- Zhou et al\.\(2018\)Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai\. 2018\.Deep Interest Network for Click\-Through Rate Prediction\. In*Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*\(London, United Kingdom\)*\(KDD ’18\)*\. Association for Computing Machinery, New York, NY, USA, 1059–1068\.[doi:10\.1145/3219819\.3219823](https://doi.org/10.1145/3219819.3219823)
- Zhu et al\.\(2025\)Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu\. 2025\.RankMixer: Scaling Up Ranking Models in Industrial Recommenders\. In*Proceedings of the 34th ACM International Conference on Information and Knowledge Management*\(Seoul, Republic of Korea\)*\(CIKM ’25\)*\. Association for Computing Machinery, New York, NY, USA, 6309–6316\.[doi:10\.1145/3746252\.3761507](https://doi.org/10.1145/3746252.3761507)
- Zhu et al\.\(2021\)Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He\. 2021\.Open Benchmarking for Click\-Through Rate Prediction\. In*Proceedings of the 30th ACM International Conference on Information & Knowledge Management*\(Virtual Event, Queensland, Australia\)*\(CIKM ’21\)*\. Association for Computing Machinery, New York, NY, USA, 2759–2769\.[doi:10\.1145/3459637\.3482486](https://doi.org/10.1145/3459637.3482486)

## Appendix AJustification of Theorem[2\.1](https://arxiv.org/html/2605.23191#S2.Thmtheorem1)

In this section we justify Theorem[2\.1](https://arxiv.org/html/2605.23191#S2.Thmtheorem1)\. For completeness, we restate the setup and result before presenting the derivations\.

#### Restated theorem\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}withD=T​dD=Tdandd≥1d\\geq 1\. PartitionXXinto aT×TT\\times Tgrid of row\-blocksXi​j∈ℝ1×dX\_\{ij\}\\in\\mathbb\{R\}^\{1\\times d\}\. Define the block\-transpose operator𝒯\\mathcal\{T\}by

\(𝒯​\(X\)\)i​j=Xj​i,\(\\mathcal\{T\}\(X\)\)\_\{ij\}=X\_\{ji\},and letY=𝒯​\(X\)Y=\\mathcal\{T\}\(X\)\. Denote

rank⁡\(X\)=r,erank⁡\(X\)=k,μ=erank⁡\(Y\)\.\\operatorname\{rank\}\(X\)=r,\\qquad\\operatorname\{erank\}\(X\)=k,\\qquad\\mu=\\operatorname\{erank\}\(Y\)\.Assume Frobenius\-orthogonality and spectral incoherence:

⟨X,Y⟩F≈0,‖X\+Y‖22≈max⁡\{‖X‖22,‖Y‖22\}\.\\langle X,Y\\rangle\_\{F\}\\approx 0,\\qquad\\\|X\+Y\\\|\_\{2\}^\{2\}\\approx\\max\\\{\\\|X\\\|\_\{2\}^\{2\},\\\|Y\\\|\_\{2\}^\{2\}\\\}\.LetM=X\+YM=X\+Y\. The theorem states

2​k​μ\(k\+μ\)2≤erank⁡\(M\)≤2​\(k\+μ\),μ≤rank⁡\(Y\)≤min⁡\{T,r​d\}\.\\frac\{2k\\mu\}\{\(\\sqrt\{k\}\+\\sqrt\{\\mu\}\)^\{2\}\}\\leq\\operatorname\{erank\}\(M\)\\leq 2\(k\+\\mu\),\\qquad\\mu\\leq\\operatorname\{rank\}\(Y\)\\leq\\min\\\{T,rd\\\}\.
We prove the result in two steps: first bounding the algebraic rank ofY=𝒯​\(X\)Y=\\mathcal\{T\}\(X\), and then analyzing the effective rank ofM=X\+𝒯​\(X\)M=X\+\\mathcal\{T\}\(X\)\.

### A\.1\.Rank properties of the block\-transpose operator

We begin with a rank characterization of the block\-transpose mapping\.

#### Setup\.

Write a rank\-rrdecomposition

X=∑ℓ=1rσℓ​𝐮ℓ​𝐯ℓ⊤,X=\\sum\_\{\\ell=1\}^\{r\}\\sigma\_\{\\ell\}\\,\\mathbf\{u\}\_\{\\ell\}\\mathbf\{v\}\_\{\\ell\}^\{\\top\},where𝐮ℓ∈ℝT\\mathbf\{u\}\_\{\\ell\}\\in\\mathbb\{R\}^\{T\}and𝐯ℓ∈ℝT​d\\mathbf\{v\}\_\{\\ell\}\\in\\mathbb\{R\}^\{Td\}\. Partition each𝐯ℓ\\mathbf\{v\}\_\{\\ell\}intoTTcontiguous segments𝐯ℓ\(j\)∈ℝ1×d\\mathbf\{v\}\_\{\\ell\}^\{\(j\)\}\\in\\mathbb\{R\}^\{1\\times d\}\.

#### Rank\-1 case\.

LetX=𝐮𝐯⊤X=\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}\. ThenXi​j=ui​𝐯\(j\)X\_\{ij\}=u\_\{i\}\\,\\mathbf\{v\}^\{\(j\)\}and

Yi​j=Xj​i=uj​𝐯\(i\)\.Y\_\{ij\}=X\_\{ji\}=u\_\{j\}\\,\\mathbf\{v\}^\{\(i\)\}\.For thejj\-th block\-column ofYY,

Cj=\[Y1​j⋮YT​j\]=uj​\[𝐯\(1\)⋮𝐯\(T\)\]\.C\_\{j\}=\\begin\{bmatrix\}Y\_\{1j\}\\\\ \\vdots\\\\ Y\_\{Tj\}\\end\{bmatrix\}=u\_\{j\}\\begin\{bmatrix\}\\mathbf\{v\}^\{\(1\)\}\\\\ \\vdots\\\\ \\mathbf\{v\}^\{\(T\)\}\\end\{bmatrix\}\.Define

Vstack=\[𝐯\(1\)⋮𝐯\(T\)\]∈ℝT×d\.V\_\{\\mathrm\{stack\}\}=\\begin\{bmatrix\}\\mathbf\{v\}^\{\(1\)\}\\\\ \\vdots\\\\ \\mathbf\{v\}^\{\(T\)\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{T\\times d\}\.ThenCj=uj​VstackC\_\{j\}=u\_\{j\}V\_\{\\mathrm\{stack\}\}, so all block\-columns ofYYlie in the column space ofVstackV\_\{\\mathrm\{stack\}\}\. Hence

rank⁡\(Y\)=rank⁡\(Vstack\)≤min⁡\{T,d\}\.\\operatorname\{rank\}\(Y\)=\\operatorname\{rank\}\(V\_\{\\mathrm\{stack\}\}\)\\leq\\min\\\{T,d\\\}\.

#### General rank\-rrcase\.

By linearity of𝒯\\mathcal\{T\},

Y=∑ℓ=1rσℓ​𝒯​\(𝐮ℓ​𝐯ℓ⊤\)\.Y=\\sum\_\{\\ell=1\}^\{r\}\\sigma\_\{\\ell\}\\,\\mathcal\{T\}\(\\mathbf\{u\}\_\{\\ell\}\\mathbf\{v\}\_\{\\ell\}^\{\\top\}\)\.Each term contributes at mostddindependent directions inℝT\\mathbb\{R\}^\{T\}, so the column space ofYYis contained in a subspace of dimension at mostr​drd\. SinceY∈ℝT×T​dY\\in\\mathbb\{R\}^\{T\\times Td\}, this yields

rank⁡\(Y\)≤min⁡\{T,r​d\}\.\\operatorname\{rank\}\(Y\)\\leq\\min\\\{T,rd\\\}\.

### A\.2\.Effective rank of the initialized sum

We now analyze the effective rank of

M=X\+𝒯​\(X\)=X\+Y\.M=X\+\\mathcal\{T\}\(X\)=X\+Y\.
#### Spectral incoherence assumption\.

We assume the initialization produces approximate Frobenius orthogonality,⟨X,Y⟩F≈0\\langle X,Y\\rangle\_\{F\}\\approx 0, and incoherent dominant singular directions,

‖X\+Y‖22≈max⁡\{‖X‖22,‖Y‖22\}\.\\\|X\+Y\\\|\_\{2\}^\{2\}\\approx\\max\\\{\\\|X\\\|\_\{2\}^\{2\},\\\|Y\\\|\_\{2\}^\{2\}\\\}\.Since𝒯\\mathcal\{T\}permutes entries,

‖Y‖F=‖X‖F\.\\\|Y\\\|\_\{F\}=\\\|X\\\|\_\{F\}\.

#### Lower bound\.

Frobenius orthogonality implies

‖M‖F2≈‖X‖F2\+‖Y‖F2≈2​‖X‖F2\.\\\|M\\\|\_\{F\}^\{2\}\\approx\\\|X\\\|\_\{F\}^\{2\}\+\\\|Y\\\|\_\{F\}^\{2\}\\approx 2\\\|X\\\|\_\{F\}^\{2\}\.Using‖M‖2≤‖X‖2\+‖Y‖2\\\|M\\\|\_\{2\}\\leq\\\|X\\\|\_\{2\}\+\\\|Y\\\|\_\{2\}and

‖X‖2=‖X‖Fk,‖Y‖2=‖X‖Fμ,\\\|X\\\|\_\{2\}=\\frac\{\\\|X\\\|\_\{F\}\}\{\\sqrt\{k\}\},\\qquad\\\|Y\\\|\_\{2\}=\\frac\{\\\|X\\\|\_\{F\}\}\{\\sqrt\{\\mu\}\},we obtain

erank⁡\(M\)=‖M‖F2‖M‖22≥2​‖X‖F2\(‖X‖Fk\+‖X‖Fμ\)2=2​k​μ\(k\+μ\)2\.\\operatorname\{erank\}\(M\)=\\frac\{\\\|M\\\|\_\{F\}^\{2\}\}\{\\\|M\\\|\_\{2\}^\{2\}\}\\geq\\frac\{2\\\|X\\\|\_\{F\}^\{2\}\}\{\\left\(\\frac\{\\\|X\\\|\_\{F\}\}\{\\sqrt\{k\}\}\+\\frac\{\\\|X\\\|\_\{F\}\}\{\\sqrt\{\\mu\}\}\\right\)^\{2\}\}=\\frac\{2k\\mu\}\{\(\\sqrt\{k\}\+\\sqrt\{\\mu\}\)^\{2\}\}\.

#### Upper bound\.

Under spectral incoherence,

‖M‖22≈max⁡\{‖X‖22,‖Y‖22\},‖M‖F2≈‖X‖F2\+‖Y‖F2\.\\\|M\\\|\_\{2\}^\{2\}\\approx\\max\\\{\\\|X\\\|\_\{2\}^\{2\},\\\|Y\\\|\_\{2\}^\{2\}\\\},\\quad\\\|M\\\|\_\{F\}^\{2\}\\approx\\\|X\\\|\_\{F\}^\{2\}\+\\\|Y\\\|\_\{F\}^\{2\}\.Therefore,

erank⁡\(M\)≈‖X‖F2\+‖Y‖F2max⁡\{‖X‖22,‖Y‖22\}≤‖X‖F2‖X‖22\+‖Y‖F2‖Y‖22=k\+μ\.\\operatorname\{erank\}\(M\)\\approx\\frac\{\\\|X\\\|\_\{F\}^\{2\}\+\\\|Y\\\|\_\{F\}^\{2\}\}\{\\max\\\{\\\|X\\\|\_\{2\}^\{2\},\\\|Y\\\|\_\{2\}^\{2\}\\\}\}\\leq\\frac\{\\\|X\\\|\_\{F\}^\{2\}\}\{\\\|X\\\|\_\{2\}^\{2\}\}\+\\frac\{\\\|Y\\\|\_\{F\}^\{2\}\}\{\\\|Y\\\|\_\{2\}^\{2\}\}=k\+\\mu\.Allowing constant slack from the approximation yields

erank⁡\(M\)≤2​\(k\+μ\)\.\\operatorname\{erank\}\(M\)\\leq 2\(k\+\\mu\)\.
Combining the algebraic\-rank bound forYYwith the effective\-rank bounds above completes the justification of Theorem[2\.1](https://arxiv.org/html/2605.23191#S2.Thmtheorem1)\.

## Appendix BJustification of Theorem[2\.2](https://arxiv.org/html/2605.23191#S2.Thmtheorem2)

The goal of this section is to give a compact, self\-contained justification of the two claims that comprise Theorem[2\.2](https://arxiv.org/html/2605.23191#S2.Thmtheorem2)\. Informally, the first claim is analgebraiclimitation that follows from positive homogeneity of the activation; the second claim is aprobabilisticcontraction of effective rank that occurs under a measurable response\-gap between principal and tail directions\. Below we state the result and the required probabilistic assumption, and then present the two proofs as subsections\. Each proof is explicit and includes the necessary concentration arguments\.

#### Restated theorem\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}satisfyerank⁡\(X\)=k\\operatorname\{erank\}\(X\)=kwithkD≤γ\\tfrac\{k\}\{D\}\\leq\\gammafor some fixedγ∈\(0,1\)\\gamma\\in\(0,1\)\. Consider the per\-row mapℱ​\(X\)=ϕ​\(X​A\)​B\\mathcal\{F\}\(X\)=\\phi\(XA\)BwhereA∈ℝD×mA\\in\\mathbb\{R\}^\{D\\times m\}andB∈ℝm×DB\\in\\mathbb\{R\}^\{m\\times D\}have i\.i\.d\. sub\-Gaussian entries with variance1/D1/D, andϕ\\phiis positively homogeneous of degree one\. The two claims below are proved in full in the subsequent subsections:

- •Deterministic homogeneity barrier: whenXXhas algebraic rank11, the output rank is at most22\(and equals11under a single\-sign condition\)\.
- •Probabilistic contraction: under the Spectral\-concentration & Response\-Gap condition and standard concentration assumptions on pre\-activations andϕ\\phi, the effective rank is contracted by a constant factorα∈\(0,1\)\\alpha\\in\(0,1\)with high probability\.

#### Assumption \(Spectral concentration & Response\-Gap\)\.

LetH=X​AH=XAandΣH=1T​H⊤​H\\Sigma\_\{H\}=\\tfrac\{1\}\{T\}H^\{\\top\}Hwith eigenvaluesλ1≥λ2≥⋯\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots\. Fix integersr≥1r\\geq 1andη∈\(0,1\]\\eta\\in\(0,1\]such that the top\-rreigenvalues capture anη\\eta\-fraction of total energy:

∑i=1rλi≥η​tr⁡\(ΣH\)\.\\sum\_\{i=1\}^\{r\}\\lambda\_\{i\}\\geq\\eta\\,\\operatorname\{tr\}\(\\Sigma\_\{H\}\)\.For a unit directionuudefine the directional response ratio

ρ​\(u\)=𝔼​\[ϕ​\(⟨u,h⟩\)2\]𝔼​\[⟨u,h⟩2\]\.\\rho\(u\)\\;=\\;\\frac\{\\mathbb\{E\}\[\\phi\(\\langle u,h\\rangle\)^\{2\}\]\}\{\\mathbb\{E\}\[\\langle u,h\\rangle^\{2\}\]\}\.Assume there exist disjoint unit\-direction sets𝒰top\\mathcal\{U\}\_\{\\mathrm\{top\}\}and𝒰low\\mathcal\{U\}\_\{\\mathrm\{low\}\}and constants0≤clow<ctop0\\leq c\_\{\\mathrm\{low\}\}<c\_\{\\mathrm\{top\}\}such that

infu∈𝒰topρ​\(u\)≥ctop,supu∈𝒰lowρ​\(u\)≤clow\.\\inf\_\{u\\in\\mathcal\{U\}\_\{\\mathrm\{top\}\}\}\\rho\(u\)\\geq c\_\{\\mathrm\{top\}\},\\qquad\\sup\_\{u\\in\\mathcal\{U\}\_\{\\mathrm\{low\}\}\}\\rho\(u\)\\leq c\_\{\\mathrm\{low\}\}\.This captures the empirical situation where the activation attenuates low\-variance \(tail\) directions more than principal directions\.

### B\.1\.Deterministic homogeneity barrier

We begin with the algebraic limitation: positive homogeneity prevents a per\-row FFN from creating more than two linearly independent output directions from a rank\-one input\. This is an exact, elementary argument\.

###### Theorem B\.1 \(Homogeneity Barrier — signed inputs\)\.

LetX=𝐜𝐯⊤∈ℝT×DX=\\mathbf\{c\}\\mathbf\{v\}^\{\\top\}\\in\\mathbb\{R\}^\{T\\times D\}\(algebraic rank 1\)\. Ifϕ\\phiis positively homogeneous of degree 1, then for anyA,BA,B,

rank⁡\(ℱ​\(X\)\)≤2,erank⁡\(ℱ​\(X\)\)≤2\.\\operatorname\{rank\}\\big\(\\mathcal\{F\}\(X\)\\big\)\\leq 2,\\qquad\\operatorname\{erank\}\\big\(\\mathcal\{F\}\(X\)\\big\)\\leq 2\.If all entries of𝐜\\mathbf\{c\}share the same sign thenrank⁡\(ℱ​\(X\)\)=1\\operatorname\{rank\}\(\\mathcal\{F\}\(X\)\)=1\.

#### Intuition\.

When every row ofXXis a scalar multiple of the same vector, each pre\-activation is a scalar multiple of one fixed vector inℝm\\mathbb\{R\}^\{m\}\. Positive homogeneity maps these scaled pre\-activations into at most two activation\-vectors \(one for positive scalar, one for negative\), and linear readout produces at most two output directions\.

###### Proof\.

WriteX=𝐜𝐯⊤X=\\mathbf\{c\}\\mathbf\{v\}^\{\\top\}\. For each rowii,

xi=ci​𝐯⊤,hi=xi​A=ci​\(𝐯⊤​A\)=ci​u,x\_\{i\}=c\_\{i\}\\mathbf\{v\}^\{\\top\},\\qquad h\_\{i\}=x\_\{i\}A=c\_\{i\}\(\\mathbf\{v\}^\{\\top\}A\)=c\_\{i\}u,withu:=𝐯⊤​A∈ℝmu:=\\mathbf\{v\}^\{\\top\}A\\in\\mathbb\{R\}^\{m\}\. Positive homogeneity implies

ϕ​\(ci​u\)=\|ci\|​ϕ​\(sign⁡\(ci\)​u\)\.\\phi\(c\_\{i\}u\)=\|c\_\{i\}\|\\,\\phi\(\\operatorname\{sign\}\(c\_\{i\}\)u\)\.Let

w\+:=ϕ​\(u\)​B,w−:=ϕ​\(−u\)​B∈ℝD,w\_\{\+\}:=\\phi\(u\)B,\\qquad w\_\{\-\}:=\\phi\(\-u\)B\\in\\mathbb\{R\}^\{D\},and writeci\+:=max⁡\(ci,0\)c\_\{i\}^\{\+\}:=\\max\(c\_\{i\},0\),ci−:=max⁡\(−ci,0\)c\_\{i\}^\{\-\}:=\\max\(\-c\_\{i\},0\)\. Then theii\-th output row equals

yi=ϕ​\(hi\)​B=ci\+​w\+\+ci−​w−\.y\_\{i\}=\\phi\(h\_\{i\}\)B=c\_\{i\}^\{\+\}w\_\{\+\}\+c\_\{i\}^\{\-\}w\_\{\-\}\.Stacking rows,

ℱ​\(X\)=𝐜\+​w\+⊤\+𝐜−​w−⊤,\\mathcal\{F\}\(X\)=\\mathbf\{c\}^\{\+\}w\_\{\+\}^\{\\top\}\+\\mathbf\{c\}^\{\-\}w\_\{\-\}^\{\\top\},a sum of at most two rank\-one outer products\. Hencerank⁡\(ℱ​\(X\)\)≤2\\operatorname\{rank\}\(\\mathcal\{F\}\(X\)\)\\leq 2anderank⁡\(ℱ​\(X\)\)≤2\\operatorname\{erank\}\(\\mathcal\{F\}\(X\)\)\\leq 2\. If all entries of𝐜\\mathbf\{c\}share the same sign, one of𝐜±\\mathbf\{c\}^\{\\pm\}vanishes and the output reduces to rank 1\. ∎

### B\.2\.Probabilistic contraction for general effective rank

We now prove the contraction claim\. The argument bounds Frobenius energy and top operator norm of the activated\-and\-readout matrix and combines these via the effective rank identity\. To keep the presentation modular and readable we present the standard concentration steps explicitly within this subsection\. To improve clarity and readability, we organize the proofs according to the outline below\.

- •Show with high probability all pre\-activations lie in a bounded interval\[−R,R\]\[\-R,R\]\.
- •Use Lipschitzness ofϕ\\phion\[−R,R\]\[\-R,R\]and Hanson–Wright to concentrate‖ϕ​\(H\)‖F2\\\|\\phi\(H\)\\\|\_\{F\}^\{2\}\.
- •Use Matrix Bernstein to concentrateΣ^ϕ=1T​ϕ​\(H\)⊤​ϕ​\(H\)\\widehat\{\\Sigma\}\_\{\\phi\}=\\tfrac\{1\}\{T\}\\phi\(H\)^\{\\top\}\\phi\(H\)\.
- •Relate expectations toΣH\\Sigma\_\{H\}via the Response\-Gap to bound𝔼​‖ϕ​\(H\)‖F2\\mathbb\{E\}\\\|\\phi\(H\)\\\|\_\{F\}^\{2\}andλmax​\(𝔼​Σ^ϕ\)\\lambda\_\{\\max\}\(\\mathbb\{E\}\\widehat\{\\Sigma\}\_\{\\phi\}\)\.
- •Propagate bounds through the linear readoutBBand combine to produceerank⁡\(ϕ​\(H\)​B\)≤α​erank⁡\(X\)\\operatorname\{erank\}\(\\phi\(H\)B\)\\leq\\alpha\\,\\operatorname\{erank\}\(X\)\.

###### Proof\.

LetH=X​A∈ℝT×mH=XA\\in\\mathbb\{R\}^\{T\\times m\},Z:=ϕ​\(H\)Z:=\\phi\(H\), andY:=Z​B=ϕ​\(H\)​BY:=ZB=\\phi\(H\)B\. DenoteΣH=1T​H⊤​H\\Sigma\_\{H\}=\\tfrac\{1\}\{T\}H^\{\\top\}Hwith eigenvaluesλ1≥λ2≥⋯\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots, andΣ^Z=1T​Z⊤​Z\\widehat\{\\Sigma\}\_\{Z\}=\\tfrac\{1\}\{T\}Z^\{\\top\}Z\.

#### \(i\) Bounded pre\-activation window\.

Assume each input rowxxsatisfies‖x‖22≤Emax\\\|x\\\|\_\{2\}^\{2\}\\leq E\_\{\\max\}\. Since entries ofAAare i\.i\.d\. sub\-Gaussian with variance1/D1/Dandψ2\\psi\_\{2\}\-norm≤K/D\\leq K/\\sqrt\{D\}, each coordinateht,jh\_\{t,j\}is sub\-Gaussian with

‖ht,j‖ψ2≤C​K​Emax/D\.\\\|h\_\{t,j\}\\\|\_\{\\psi\_\{2\}\}\\leq CK\\sqrt\{E\_\{\\max\}/D\}\.By a union bound over theT​mTmcoordinates, for anyδ∈\(0,1\)\\delta\\in\(0,1\)with probability at least1−δ1\-\\deltaall pre\-activations lie in

\[−R,R\],R=C′​K​Emax​log⁡\(m​T/δ\)D,\[\-R,R\],\\qquad R=C^\{\\prime\}K\\sqrt\{\\frac\{E\_\{\\max\}\\log\(mT/\\delta\)\}\{D\}\},for absolute constantsC,C′C,C^\{\\prime\}\. Fixδ\\deltapolynomially small and condition on this high\-probability event\.

#### \(ii\) Frobenius concentration \(Hanson–Wright\)\.

Becauseϕ\\phiisLL\-Lipschitz on\[−R,R\]\[\-R,R\], eachϕ​\(ht,j\)\\phi\(h\_\{t,j\}\)is sub\-Gaussian with parameterK~=O​\(K​L​Emax/D\)\\tilde\{K\}=O\(KL\\sqrt\{E\_\{\\max\}/D\}\)\. Applying Hanson–Wright \(or Bernstein\-style bounds\) to the quadratic sum gives: for anyt≥0t\\geq 0,

Pr⁡\(\|‖Z‖F2−𝔼​‖Z‖F2\|≥t\)≤2​exp⁡\(−c​min⁡\{t2K4​L4​T​m,tK2​L2\}\),\\Pr\\\!\\big\(\\big\|\\\|Z\\\|\_\{F\}^\{2\}\-\\mathbb\{E\}\\\|Z\\\|\_\{F\}^\{2\}\\big\|\\geq t\\big\)\\leq 2\\exp\\\!\\Big\(\-c\\min\\Big\\\{\\frac\{t^\{2\}\}\{K^\{4\}L^\{4\}Tm\},\\;\\frac\{t\}\{K^\{2\}L^\{2\}\}\\Big\\\}\\Big\),for an absolutec\>0c\>0\. Thus‖Z‖F2\\\|Z\\\|\_\{F\}^\{2\}concentrates sharply around𝔼​‖Z‖F2\\mathbb\{E\}\\\|Z\\\|\_\{F\}^\{2\}\.

#### \(iii\) Spectral concentration \(Matrix Bernstein\)\.

RepresentΣ^Z=1T​∑t=1Tzt​zt⊤\\widehat\{\\Sigma\}\_\{Z\}=\\tfrac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}z\_\{t\}z\_\{t\}^\{\\top\}with rowsztz\_\{t\}\. Each summand has operator\-norm and variance proxy controlled byK2​L2​Emax/DK^\{2\}L^\{2\}E\_\{\\max\}/D\. Matrix Bernstein yields: for anyτ\>0\\tau\>0,

Pr⁡\(‖Σ^Z−𝔼​Σ^Z‖op≥τ\)≤2​m​exp⁡\(−c​T​τ2\(K2​L2​Emax/D\)2​r2\)\.\\Pr\\big\(\\\|\\widehat\{\\Sigma\}\_\{Z\}\-\\mathbb\{E\}\\widehat\{\\Sigma\}\_\{Z\}\\\|\_\{\\mathrm\{op\}\}\\geq\\tau\\big\)\\leq 2m\\exp\\\!\\Big\(\-c\\frac\{T\\tau^\{2\}\}\{\(K^\{2\}L^\{2\}E\_\{\\max\}/D\)^\{2\}r^\{2\}\}\\Big\)\.Taking

τ=Θ​\(K2​L2​EmaxD​r2​log⁡mT\)\\tau=\\Theta\\\!\\Big\(\\frac\{K^\{2\}L^\{2\}E\_\{\\max\}\}\{D\}\\sqrt\{\\frac\{r^\{2\}\\log m\}\{T\}\}\\Big\)gives‖Σ^Z−𝔼​Σ^Z‖op≤τ\\\|\\widehat\{\\Sigma\}\_\{Z\}\-\\mathbb\{E\}\\widehat\{\\Sigma\}\_\{Z\}\\\|\_\{\\mathrm\{op\}\}\\leq\\tauwith failure probabilityO​\(exp⁡\(−Θ​\(r\)\)\)O\(\\exp\(\-\\Theta\(r\)\)\)under the stated scaling\.

#### \(iv\) Expectations and the Response\-Gap\.

Let\{uj\}j=1m\\\{u\_\{j\}\\\}\_\{j=1\}^\{m\}be an orthonormal eigenbasis ofΣH\\Sigma\_\{H\}\. Then

𝔼​‖Z‖F2=T​∑j=1mρ​\(uj\)​λj\.\\mathbb\{E\}\\\|Z\\\|\_\{F\}^\{2\}=T\\sum\_\{j=1\}^\{m\}\\rho\(u\_\{j\}\)\\,\\lambda\_\{j\}\.Splitting into top \(j≤rj\\leq r\) and tail \(j\>rj\>r\) components and applying the Response\-Gap bounds yields

𝔼​‖Z‖F2\\displaystyle\\mathbb\{E\}\\\|Z\\\|\_\{F\}^\{2\}≤T​\(ctop​∑j=1rλj\+clow​∑j=r\+1mλj\)\\displaystyle\\leq T\\big\(c\_\{\\mathrm\{top\}\}\\sum\_\{j=1\}^\{r\}\\lambda\_\{j\}\+c\_\{\\mathrm\{low\}\}\\sum\_\{j=r\+1\}^\{m\}\\lambda\_\{j\}\\big\)=T​\(ctop​η\+clow​\(1−η\)\)​tr⁡\(ΣH\)\\displaystyle=T\\big\(c\_\{\\mathrm\{top\}\}\\eta\+c\_\{\\mathrm\{low\}\}\(1\-\\eta\)\\big\)\\operatorname\{tr\}\(\\Sigma\_\{H\}\)\(5\)=:TMFtr\(ΣH\)\.\\displaystyle=:TM\_\{F\}\\operatorname\{tr\}\(\\Sigma\_\{H\}\)\.
Moreover,

λmax​\(𝔼​Σ^Z\)≥ctop​λ1​\(ΣH\)\.\\lambda\_\{\\max\}\(\\mathbb\{E\}\\widehat\{\\Sigma\}\_\{Z\}\)\\geq c\_\{\\mathrm\{top\}\}\\lambda\_\{1\}\(\\Sigma\_\{H\}\)\.

#### \(v\) Effect of linear readoutBB\.

LetY=Z​BY=ZB\. SinceBBhas i\.i\.d\. entries with variance1/D1/D, standard moment computations show that application ofBBscales Frobenius and operator norms by multiplicative factorsκF,κop\>0\\kappa\_\{F\},\\kappa\_\{\\mathrm\{op\}\}\>0\(which are1±o​\(1\)1\\pm o\(1\)under the random matrix scaling regime\)\. After applyingBBthe concentration and expectation bounds from \(ii\)–\(iv\) transfer to‖Y‖F2\\\|Y\\\|\_\{F\}^\{2\}and‖Y‖22\\\|Y\\\|\_\{2\}^\{2\}up to these multiplicative corrections and small additive fluctuations\.

#### \(vi\) Combine to bound effective rank\.

Usingerank⁡\(Y\)=‖Y‖F2/‖Y‖22\\operatorname\{erank\}\(Y\)=\\\|Y\\\|\_\{F\}^\{2\}/\\\|Y\\\|\_\{2\}^\{2\}, the preceding bounds imply that with probability at least1−O​\(exp⁡\(−Θ​\(r\)\)\)1\-O\(\\exp\(\-\\Theta\(r\)\)\),

‖Y‖F2≤κF⋅T​MF​tr⁡\(ΣH\)\+δF,‖Y‖22≥κop⋅\(ctop​λ1​\(ΣH\)−τ\)−δop,\\\|Y\\\|\_\{F\}^\{2\}\\leq\\kappa\_\{F\}\\cdot TM\_\{F\}\\operatorname\{tr\}\(\\Sigma\_\{H\}\)\+\\delta\_\{F\},\\qquad\\\|Y\\\|\_\{2\}^\{2\}\\geq\\kappa\_\{\\mathrm\{op\}\}\\cdot\\big\(c\_\{\\mathrm\{top\}\}\\lambda\_\{1\}\(\\Sigma\_\{H\}\)\-\\tau\\big\)\-\\delta\_\{\\mathrm\{op\}\},whereτ,δF,δop\\tau,\\delta\_\{F\},\\delta\_\{\\mathrm\{op\}\}are small concentration errors from \(ii\),\(iii\) and the randomness ofBB\. Hence

erank⁡\(Y\)≤κF​MF\+o​\(1\)κop​\(ctop−o​\(1\)\)⋅tr⁡\(ΣH\)λ1​\(ΣH\)\.\\operatorname\{erank\}\(Y\)\\leq\\frac\{\\kappa\_\{F\}M\_\{F\}\+o\(1\)\}\{\\kappa\_\{\\mathrm\{op\}\}\(c\_\{\\mathrm\{top\}\}\-o\(1\)\)\}\\cdot\\frac\{\\operatorname\{tr\}\(\\Sigma\_\{H\}\)\}\{\\lambda\_\{1\}\(\\Sigma\_\{H\}\)\}\.Defining

α:=κF​MF\+o​\(1\)κop​\(ctop−o​\(1\)\),\\alpha:=\\frac\{\\kappa\_\{F\}M\_\{F\}\+o\(1\)\}\{\\kappa\_\{\\mathrm\{op\}\}\(c\_\{\\mathrm\{top\}\}\-o\(1\)\)\},and notingMF=ctop​η\+clow​\(1−η\)<ctopM\_\{F\}=c\_\{\\mathrm\{top\}\}\\eta\+c\_\{\\mathrm\{low\}\}\(1\-\\eta\)<c\_\{\\mathrm\{top\}\}becauseclow<ctopc\_\{\\mathrm\{low\}\}<c\_\{\\mathrm\{top\}\}andη∈\(0,1\]\\eta\\in\(0,1\], we conclude thatα∈\(0,1\)\\alpha\\in\(0,1\)for sufficiently small concentration errors\. Therefore, with probability at least1−O​\(exp⁡\(−Θ​\(r\)\)\)1\-O\(\\exp\(\-\\Theta\(r\)\)\),

erank⁡\(Y\)≤α​erank⁡\(X\),\\operatorname\{erank\}\(Y\)\\leq\\alpha\\,\\operatorname\{erank\}\(X\),which establishes the probabilistic contraction claim\. ∎

## Appendix CJustification of Theorem[3\.1](https://arxiv.org/html/2605.23191#S3.Thmtheorem1)

In this appendix we justify Theorem[3\.1](https://arxiv.org/html/2605.23191#S3.Thmtheorem1)\. We show that the block granularity parameterd∗d^\{\\ast\}controls a trade\-off between computational efficiency and representational power: the fine\-grained cased∗=1d^\{\\ast\}=1is strictly more expressive than any coarse\-grained cased∗\>1d^\{\\ast\}\>1, and this expressivity gap is particularly salient when the input has low effective rank\.

#### Restated theorem\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}and𝐱=vec⁡\(X\)∈ℝN\\mathbf\{x\}=\\operatorname\{vec\}\(X\)\\in\\mathbb\{R\}^\{N\}withN=T​DN=TD\. Fix a divisord∗d^\{\\ast\}ofNNand writeK=N/d∗K=N/d^\{\\ast\}\. Partition𝐱\\mathbf\{x\}intoKKcontiguous blocksℬ=\{𝐛1,…,𝐛K\}\\mathcal\{B\}=\\\{\\mathbf\{b\}\_\{1\},\\dots,\\mathbf\{b\}\_\{K\}\\\}with𝐛k∈ℝd∗\\mathbf\{b\}\_\{k\}\\in\\mathbb\{R\}^\{d^\{\\ast\}\}\. A parameterized block\-mixing layer with weight matrix𝐖∈ℝK×K\\mathbf\{W\}\\in\\mathbb\{R\}^\{K\\times K\}produces

𝐲=\(𝐖⊗𝐈d∗\)​𝐱,\\mathbf\{y\}=\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)\\,\\mathbf\{x\},and the residual output is

𝐂=Φd∗​\(X;𝐖\)≜X\+reshape⁡\(\(𝐖⊗𝐈d∗\)​vec⁡\(X\)\)\.\\mathbf\{C\}=\\Phi\_\{d^\{\\ast\}\}\(X;\\mathbf\{W\}\)\\triangleq X\+\\operatorname\{reshape\}\\bigl\(\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)\\operatorname\{vec\}\(X\)\\bigr\)\.Whend∗=1d^\{\\ast\}=1this reduces to a full learned linear mixing on coordinates\.

### C\.1\.Expressivity gap theorem

We formalize the expressivity limitation of coarse blocking\.

###### Theorem C\.1 \(Fine\-grained expressivity and subspace constraint\)\.

Define the reachable set

ℛd∗​\(X\)=\{C∈ℝT×D∣∃𝐖,C=Φd∗​\(X;𝐖\)\}\.\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\)=\\bigl\\\{\\,C\\in\\mathbb\{R\}^\{T\\times D\}\\mid\\exists\\,\\mathbf\{W\},\\ C=\\Phi\_\{d^\{\\ast\}\}\(X;\\mathbf\{W\}\)\\,\\bigr\\\}\.AssumeX≠0X\\neq 0\. Then:

- \(i\)\(Universal reachability atd∗=1d^\{\\ast\}=1\)ℛ1​\(X\)=ℝT×D\\mathcal\{R\}\_\{1\}\(X\)=\\mathbb\{R\}^\{T\\times D\}\.
- \(ii\)\(Strict deficiency atd∗\>1d^\{\\ast\}\>1\) For everyd∗\>1d^\{\\ast\}\>1,ℛd∗​\(X\)⊊ℛ1​\(X\)\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\)\\subsetneq\\mathcal\{R\}\_\{1\}\(X\)\.
- \(iii\)\(Dependence on effective rank\) LetVblocks=span⁡\{𝐛1,…,𝐛K\}⊂ℝd∗V\_\{\\mathrm\{blocks\}\}=\\operatorname\{span\}\\\{\\mathbf\{b\}\_\{1\},\\dots,\\mathbf\{b\}\_\{K\}\\\}\\subset\\mathbb\{R\}^\{d^\{\\ast\}\}\. All perturbations produced byΦd∗\\Phi\_\{d^\{\\ast\}\}have each block lying inVblocksV\_\{\\mathrm\{blocks\}\}\. Iferank⁡\(X\)≪min⁡\{T,D\}\\operatorname\{erank\}\(X\)\\ll\\min\\\{T,D\\\}, then the𝐛k\\mathbf\{b\}\_\{k\}are approximately confined to a low\-dimensional subspace ofℝd∗\\mathbb\{R\}^\{d^\{\\ast\}\}, so there exist fine\-grained variations \(directions orthogonal toVblocksV\_\{\\mathrm\{blocks\}\}\) thatΦd∗\\Phi\_\{d^\{\\ast\}\}cannot express\.

###### Proof\.

We structure the proof into two parts: \(i\) the cased∗=1d^\{\\ast\}=1, and \(ii\)–\(iii\) the cased∗\>1d^\{\\ast\}\>1\.

#### Part \(i\)\.

Ford∗=1d^\{\\ast\}=1, whend∗=1d^\{\\ast\}=1each block is a scalar andK=NK=N\. The update term becomes𝐖𝐱\\mathbf\{W\}\\mathbf\{x\}andvec⁡\(C\)=𝐱\+𝐖𝐱=\(𝐈\+𝐖\)​𝐱\\operatorname\{vec\}\(C\)=\\mathbf\{x\}\+\\mathbf\{W\}\\mathbf\{x\}=\(\\mathbf\{I\}\+\\mathbf\{W\}\)\\mathbf\{x\}\. Since𝐱≠0\\mathbf\{x\}\\neq 0, for any target vector𝐳∈ℝN\\mathbf\{z\}\\in\\mathbb\{R\}^\{N\}we can choose𝐖\\mathbf\{W\}\(e\.g\. via the outer\-product construction𝐖=\(𝐳−𝐱\)​𝐱⊤/\(𝐱⊤​𝐱\)\\mathbf\{W\}=\(\\mathbf\{z\}\-\\mathbf\{x\}\)\\mathbf\{x\}^\{\\top\}/\(\\mathbf\{x\}^\{\\top\}\\mathbf\{x\}\)\) so that\(𝐈\+𝐖\)​𝐱=𝐳\(\\mathbf\{I\}\+\\mathbf\{W\}\)\\mathbf\{x\}=\\mathbf\{z\}\. Henceℛ1​\(X\)=ℝT×D\\mathcal\{R\}\_\{1\}\(X\)=\\mathbb\{R\}^\{T\\times D\}\.

#### Part \(ii\) and \(iii\)\.

For generald∗\>1d^\{\\ast\}\>1, write the perturbation𝐩=vec⁡\(C−X\)=\(𝐖⊗𝐈d∗\)​𝐱\\mathbf\{p\}=\\operatorname\{vec\}\(C\-X\)=\(\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}\)\\mathbf\{x\}and partition𝐩\\mathbf\{p\}into blocks𝐩1,…,𝐩K\\mathbf\{p\}\_\{1\},\\dots,\\mathbf\{p\}\_\{K\}\. By block\-Kronecker structure thekk\-th block satisfies

𝐩k=∑j=1KWk​j​𝐛j,\\mathbf\{p\}\_\{k\}=\\sum\_\{j=1\}^\{K\}W\_\{kj\}\\,\\mathbf\{b\}\_\{j\},hence𝐩k∈Vblocks\\mathbf\{p\}\_\{k\}\\in V\_\{\\mathrm\{blocks\}\}for everykk\. Thus every reachable perturbation is blockwise constrained to the subspaceVblocks⊆ℝd∗V\_\{\\mathrm\{blocks\}\}\\subseteq\\mathbb\{R\}^\{d^\{\\ast\}\}; equivalently,ℛd∗​\(X\)\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\)lies inside a linear submanifold ofℝT×D\\mathbb\{R\}^\{T\\times D\}of strictly smaller dimension wheneverdim\(Vblocks\)<d∗\\dim\(V\_\{\\mathrm\{blocks\}\}\)<d^\{\\ast\}\. Therefore, wheneverVblocks≠ℝd∗V\_\{\\mathrm\{blocks\}\}\\neq\\mathbb\{R\}^\{d^\{\\ast\}\}there exist target matrices whose required block perturbation contains a component orthogonal toVblocksV\_\{\\mathrm\{blocks\}\}; such targets are not inℛd∗​\(X\)\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\), soℛd∗​\(X\)⊊ℛ1​\(X\)\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\)\\subsetneq\\mathcal\{R\}\_\{1\}\(X\)\.

On the root of low effective rank makes the deficiency generic, note thaterank⁡\(X\)=‖X‖F2/‖X‖22≪min⁡\{T,D\}\\operatorname\{erank\}\(X\)=\\\|X\\\|\_\{F\}^\{2\}/\\\|X\\\|\_\{2\}^\{2\}\\ll\\min\\\{T,D\\\}implies that most energy ofXXis concentrated in a few global singular directions\. Under standard vectorization \(row\- or column\-major\) contiguous local blocks𝐛k\\mathbf\{b\}\_\{k\}therefore inherit strong correlations and typically span a low\-dimensional subspace ofℝd∗\\mathbb\{R\}^\{d^\{\\ast\}\}\. Concretely,dim\(Vblocks\)≪d∗\\dim\(V\_\{\\mathrm\{blocks\}\}\)\\ll d^\{\\ast\}in this regime, so many fine\-grained directions orthogonal toVblocksV\_\{\\mathrm\{blocks\}\}exist and cannot be generated by the Kronecker\-constrained operator\. This establishes the dependence on effective rank and completes the proof\. ∎

#### Remark\.

The argument shows that the expressivity loss ford∗\>1d^\{\\ast\}\>1is structural: the Kronecker factor𝐖⊗𝐈d∗\\mathbf\{W\}\\otimes\\mathbf\{I\}\_\{d^\{\\ast\}\}enforces that all block\-wise updates lie in the same block\-span\. If desired, one can make the statements quantitative by lower\-bounding the distance between a chosen fine\-grained target and the subspaceℛd∗​\(X\)\\mathcal\{R\}\_\{d^\{\\ast\}\}\(X\)in terms of the singular\-value decay ofXXand the principal angles between block samples; we omit these technical refinements for brevity\.

## Appendix DJustification of Theorem[3\.2](https://arxiv.org/html/2605.23191#S3.Thmtheorem2)

This section provides a complete, self\-contained justification of Theorem[3\.2](https://arxiv.org/html/2605.23191#S3.Thmtheorem2)\. We first restate the claims for convenience, then prove \(i\) thealgebraic liftingthat the GLU\-based multiplicative term generates degree\-2 interactions of the latent coordinates, and \(ii\) a quantitative, high\-probabilityeffective rank increaseby combining Frobenius/operator concentration with a Jacobian / trace lower bound that certifies the numerical usefulness of the new directions\. Throughout we use the notation⊙\\odotfor Hadamard product, and assume per\-row energy‖x‖22≤Emax\\\|x\\\|\_\{2\}^\{2\}\\leq E\_\{\\max\}\.

#### Restated theorem\.

LetX∈ℝT×DX\\in\\mathbb\{R\}^\{T\\times D\}satisfyerank⁡\(X\)=k\\operatorname\{erank\}\(X\)=kwithk/D≤γ∈\(0,1\)k/D\\leq\\gamma\\in\(0,1\)\. Consider

𝒢​\(X\)=\(ϕ​\(X​A\)⊙\(X​C\)\)​B\+X​D,\\mathcal\{G\}\(X\)=\\big\(\\phi\(XA\)\\odot\(XC\)\\big\)B\+XD,withA,C∈ℝD×mA,C\\in\\mathbb\{R\}^\{D\\times m\}andB∈ℝm×DB\\in\\mathbb\{R\}^\{m\\times D\}having i\.i\.d\. sub\-Gaussian entries of variance1/D1/Dandψ2\\psi\_\{2\}\-norm≤K/D\\leq K/\\sqrt\{D\}\. If the hidden width satisfiesm≥C0​k​log⁡Dm\\geq C\_\{0\}k\\log Dfor a universalC0C\_\{0\}, then with probability at least1−exp⁡\(−c​k\)1\-\\exp\(\-ck\):

- •\(Algebraic lifting\)rank⁡\(ϕ​\(X​A\)⊙\(X​C\)\)≥min⁡\(D,k​\(k\+1\)2\)\\operatorname\{rank\}\\\!\\big\(\\phi\(XA\)\\odot\(XC\)\\big\)\\geq\\min\\\!\\big\(D,\\tfrac\{k\(k\+1\)\}\{2\}\\big\)\.
- •\(effective rank increase\) There existsδ\>0\\delta\>0\(depending onγ,K,Emax\\gamma,K,E\_\{\\max\}\) such that erank⁡\(𝒢​\(X\)\)≥erank⁡\(X\)\+δ\.\\operatorname\{erank\}\\big\(\\mathcal\{G\}\(X\)\\big\)\\geq\\operatorname\{erank\}\(X\)\+\\delta\.

The remainder of this section proves these two claims in order, and then provides the Jacobian / trace analysis that explains why the added algebraic directions are numerically significant \(non\-vanishing singular values\)\.

### D\.1\.Algebraic lifting: polynomial degree\-2 span

We begin by showing the multiplicative GLU\-style term realizes degree\-2 monomials in thekk\-dimensional latent coefficients ofXX; random projection then turns these monomials into new independent algebraic directions with high probability\.

#### Latent factorization\.

Sinceerank⁡\(X\)=k\\operatorname\{erank\}\(X\)=kthere is a factorization

X=S​V⊤,S∈ℝT×k,V∈ℝD×k,V⊤​V=Ik,X=SV^\{\\top\},\\qquad S\\in\\mathbb\{R\}^\{T\\times k\},\\quad V\\in\\mathbb\{R\}^\{D\\times k\},\\quad V^\{\\top\}V=I\_\{k\},so each rowxt=V​stx\_\{t\}=Vs\_\{t\}forst∈ℝks\_\{t\}\\in\\mathbb\{R\}^\{k\}\. Define

U:=V⊤​A∈ℝk×m,W:=V⊤​C∈ℝk×m,U:=V^\{\\top\}A\\in\\mathbb\{R\}^\{k\\times m\},\\qquad W:=V^\{\\top\}C\\in\\mathbb\{R\}^\{k\\times m\},and letuj,wj∈ℝku\_\{j\},w\_\{j\}\\in\\mathbb\{R\}^\{k\}denote columnjjofU,WU,Wrespectively\. Then the multiplicative pre\-features have entries

zt,j=ϕ​\(⟨uj,st⟩\)​⟨wj,st⟩\.z\_\{t,j\}\\;=\\;\\phi\(\\langle u\_\{j\},s\_\{t\}\\rangle\)\\,\\langle w\_\{j\},s\_\{t\}\\rangle\.

#### Degree\-2 component\.

Expandϕ\\phiaround zero \(valid in a small\-variance pre\-activation regime; the same argument applies more generally by single\-variable Taylor / polynomial approximation on a bounded window\):

ϕ​\(z\)=a1​z\+a2​z2\+O​\(z3\)\.\\phi\(z\)=a\_\{1\}z\+a\_\{2\}z^\{2\}\+O\(z^\{3\}\)\.Hence the leading degree\-2 component ofzt,jz\_\{t,j\}is

qt,j:=a1⋅⟨uj,st⟩​⟨wj,st⟩\.q\_\{t,j\}:=a\_\{1\}\\cdot\\langle u\_\{j\},s\_\{t\}\\rangle\\langle w\_\{j\},s\_\{t\}\\rangle\.Write the symmetric quadratic monomials vector

ζt:=vecs⁡\(st​st⊤\)∈ℝq,q=k​\(k\+1\)2,\\zeta\_\{t\}:=\\operatorname\{vecs\}\(s\_\{t\}s\_\{t\}^\{\\top\}\)\\in\\mathbb\{R\}^\{q\},\\qquad q=\\frac\{k\(k\+1\)\}\{2\},so there is a deterministic linear functionalrj∈ℝqr\_\{j\}\\in\\mathbb\{R\}^\{q\}\(the symmetric vectorization of the matrixuj​wj⊤u\_\{j\}w\_\{j\}^\{\\top\}\) with

qt,j=a1​⟨rj,ζt⟩\.q\_\{t,j\}=a\_\{1\}\\langle r\_\{j\},\\zeta\_\{t\}\\rangle\.

#### Random linear map onto hidden units\.

Stack theζt\\zeta\_\{t\}into𝒵∈ℝT×q\\mathcal\{Z\}\\in\\mathbb\{R\}^\{T\\times q\}\(rowsζt⊤\\zeta\_\{t\}^\{\\top\}\)\. The degree\-2 part across all hidden units is

Zquad=𝒵​R⊤,R:=\[r1,…,rm\]⊤∈ℝm×q\.Z\_\{\\mathrm\{quad\}\}\\;=\\;\\mathcal\{Z\}\\,R^\{\\top\},\\qquad R:=\[r\_\{1\},\\dots,r\_\{m\}\]^\{\\top\}\\in\\mathbb\{R\}^\{m\\times q\}\.BecauseA,CA,Care random sub\-Gaussian andVVis fixed orthonormal, the rowsrjr\_\{j\}ofRRare i\.i\.d\. sub\-Gaussian vectors inℝq\\mathbb\{R\}^\{q\}with variance scale1/D1/D\(up toKKfactors\)\. Standard non\-asymptotic results \(see Vershynin\) imply that for any fixedqq\-dimensional subspace the random matrixRRis injective with high probability providedm≳qm\\gtrsim q\(and under a mildlog⁡D\\log Dfactor to carry subsequent readout\)\. Concretely, if

m≥C0​q​log⁡D=C0​k​\(k\+1\)2​log⁡D,m\\geq C\_\{0\}\\,q\\log D=C\_\{0\}\\,\\tfrac\{k\(k\+1\)\}\{2\}\\log D,then with probability at least1−exp⁡\(−c​q\)1\-\\exp\(\-cq\)we haverank⁡\(R\)=min⁡\{m,q\}\\operatorname\{rank\}\(R\)=\\min\\\{m,q\\\}\.

#### Passage through readoutBB\.

The multiplicative term before readout has row space equal to the column span of𝒵\\mathcal\{Z\}after applyingR⊤R^\{\\top\}\. Applying the linear readoutB∈ℝm×DB\\in\\mathbb\{R\}^\{m\\times D\}\(random i\.i\.d\. entries of variance1/D1/D\) mapsℝm\\mathbb\{R\}^\{m\}intoℝD\\mathbb\{R\}^\{D\}\. With high probabilityBBhas rankmin⁡\{m,D\}\\min\\\{m,D\\\}\. Combining these rank lower bounds gives

rank⁡\(\(ϕ​\(X​A\)⊙\(X​C\)\)​B\)≥min⁡\{D,q\}=min⁡\(D,k​\(k\+1\)2\)\\operatorname\{rank\}\\\!\\big\(\(\\phi\(XA\)\\odot\(XC\)\)B\\big\)\\geq\\min\\\{D,\\ q\\\}=\\min\\\!\\left\(D,\\frac\{k\(k\+1\)\}\{2\}\\right\)with probability at least1−exp⁡\(−c′​k\)1\-\\exp\(\-c^\{\\prime\}k\)\. This proves the algebraic lifting claim\.

### D\.2\.Quantitative effective\-rank increase and Jacobian spectrum

Having established algebraic lifting, we now show the multiplicative term injects nontrivial Frobenius energy orthogonal to the original linear path and that the GLU\-improved P\-FFN’s Jacobian has large trace \(sum of squared singular values\) under the width condition\. Together with the residualX​DXDthese observations imply a numerical \(effective rank\) increase\.

#### Notation and decomposition\.

Write the multiplicative featuresZ∈ℝT×mZ\\in\\mathbb\{R\}^\{T\\times m\}with entrieszt,j=ϕ​\(⟨uj,st⟩\)​⟨wj,st⟩z\_\{t,j\}=\\phi\(\\langle u\_\{j\},s\_\{t\}\\rangle\)\\langle w\_\{j\},s\_\{t\}\\rangle\. After readoutBBand adding the residualX​DXDthe network output is

LetP∥P\_\{\\parallel\}denote the orthogonal projector onto the row\-space ofXX\(dimension≤k\\leq k\) andP⟂=I−P∥P\_\{\\perp\}=I\-P\_\{\\parallel\}\. We aim to lower bound𝔼​‖P⟂​Y‖F2\\mathbb\{E\}\\\|P\_\{\\perp\}Y\\\|\_\{F\}^\{2\}and show it isΩ​\(m−1​k​log⁡D\)\\Omega\(m^\{\-1\}k\\log D\)under the stated scaling; this suffices to increase effective rank by a positiveδ\\delta\.

#### Orthogonal energy is preserved in expectation through random readout\.

BecauseBBhas i\.i\.d\. entries with variance1/D1/Dand is independent ofZZ,

𝔼B​‖P⟂​\(Z​B\)‖F2\\displaystyle\\mathbb\{E\}\_\{B\}\\\|P\_\{\\perp\}\(ZB\)\\\|\_\{F\}^\{2\}=𝔼B​tr⁡\(B⊤​Z⊤​P⟂​Z​B\)\\displaystyle=\\mathbb\{E\}\_\{B\}\\operatorname\{tr\}\\big\(B^\{\\top\}Z^\{\\top\}P\_\{\\perp\}ZB\\big\)=tr⁡\(𝔼B​\[B​B⊤\]​𝔼​\[Z⊤​P⟂​Z\]\)\\displaystyle=\\operatorname\{tr\}\\big\(\\mathbb\{E\}\_\{B\}\[BB^\{\\top\}\]\\ \\mathbb\{E\}\[Z^\{\\top\}P\_\{\\perp\}Z\]\\big\)\(7\)=𝔼​‖P⟂​Z‖F2\.\\displaystyle=\\mathbb\{E\}\\\|P\_\{\\perp\}Z\\\|\_\{F\}^\{2\}\.Thus it suffices to lower bound𝔼​‖P⟂​Z‖F2\\mathbb\{E\}\\\|P\_\{\\perp\}Z\\\|\_\{F\}^\{2\}\.

#### Columnwise contributions originate from quadratic features\.

DecomposeZ=\[z\(1\)​\|…\|​z\(m\)\]Z=\[z^\{\(1\)\}\|\\dots\|z^\{\(m\)\}\]with columnsz\(j\)∈ℝTz^\{\(j\)\}\\in\\mathbb\{R\}^\{T\}\(entrieszt\(j\)z^\{\(j\)\}\_\{t\}\)\. Using the degree\-2 decompositionz\(j\)=q\(j\)\+r\(j\)z^\{\(j\)\}=q^\{\(j\)\}\+r^\{\(j\)\}whereqt\(j\)=a1​⟨uj,st⟩​⟨wj,st⟩q^\{\(j\)\}\_\{t\}=a\_\{1\}\\langle u\_\{j\},s\_\{t\}\\rangle\\langle w\_\{j\},s\_\{t\}\\rangleis the degree\-2 component andr\(j\)r^\{\(j\)\}collects higher\-order / smaller terms, we have

𝔼​‖P⟂​Z‖F2≥∑j=1m𝔼​‖P⟂​q\(j\)‖22−∑j=1m𝔼​‖r\(j\)‖22\.\\mathbb\{E\}\\\|P\_\{\\perp\}Z\\\|\_\{F\}^\{2\}\\geq\\sum\_\{j=1\}^\{m\}\\mathbb\{E\}\\\|P\_\{\\perp\}q^\{\(j\)\}\\\|\_\{2\}^\{2\}\-\\sum\_\{j=1\}^\{m\}\\mathbb\{E\}\\\|r^\{\(j\)\}\\\|\_\{2\}^\{2\}\.Under the small\-variance regime or by truncation \(pre\-activation bounded window\) the remainder∑j𝔼​‖r\(j\)‖22\\sum\_\{j\}\\mathbb\{E\}\\\|r^\{\(j\)\}\\\|\_\{2\}^\{2\}is lower order; we therefore focus on the dominantq\(j\)q^\{\(j\)\}terms\.

As in the lifting argument,q\(j\)=𝒵​Qjq^\{\(j\)\}=\\mathcal\{Z\}Q\_\{j\}where𝒵∈ℝT×q\\mathcal\{Z\}\\in\\mathbb\{R\}^\{T\\times q\}collects symmetric quadratic monomialsζt\\zeta\_\{t\}andQj∈ℝqQ\_\{j\}\\in\\mathbb\{R\}^\{q\}is the symmetric vectorization ofuj​wj⊤u\_\{j\}w\_\{j\}^\{\\top\}\. Thus

𝔼​‖P⟂​q\(j\)‖22=𝔼Qj​Qj⊤​\(𝒵⊤​P⟂​𝒵\)​Qj\.\\mathbb\{E\}\\\|P\_\{\\perp\}q^\{\(j\)\}\\\|\_\{2\}^\{2\}=\\mathbb\{E\}\_\{Q\_\{j\}\}\\,Q\_\{j\}^\{\\top\}\\big\(\\mathcal\{Z\}^\{\\top\}P\_\{\\perp\}\\mathcal\{Z\}\\big\)Q\_\{j\}\.Averaging over randomQjQ\_\{j\}\(sub\-Gaussian rows\) gives

𝔼Qj​𝔼S​‖P⟂​q\(j\)‖22=tr⁡\(𝒵⊤​P⟂​𝒵⋅𝔼​\[Qj​Qj⊤\]\)\.\\mathbb\{E\}\_\{Q\_\{j\}\}\\mathbb\{E\}\_\{S\}\\\|P\_\{\\perp\}q^\{\(j\)\}\\\|\_\{2\}^\{2\}=\\operatorname\{tr\}\\big\(\\mathcal\{Z\}^\{\\top\}P\_\{\\perp\}\\mathcal\{Z\}\\cdot\\mathbb\{E\}\[Q\_\{j\}Q\_\{j\}^\{\\top\}\]\\big\)\.Because𝔼​\[Qj​Qj⊤\]\\mathbb\{E\}\[Q\_\{j\}Q\_\{j\}^\{\\top\}\]is proportional to the identity on theqq\-dimensional monomial space \(up to constants depending onK,EmaxK,E\_\{\\max\}\), we obtain

𝔼​‖P⟂​q\(j\)‖22≥c0⋅tr⁡\(𝒵⊤​P⟂​𝒵\)=c0​∑t=1T‖P⟂​ζt‖22,\\mathbb\{E\}\\\|P\_\{\\perp\}q^\{\(j\)\}\\\|\_\{2\}^\{2\}\\;\\geq\\;c\_\{0\}\\cdot\\operatorname\{tr\}\\big\(\\mathcal\{Z\}^\{\\top\}P\_\{\\perp\}\\mathcal\{Z\}\\big\)=c\_\{0\}\\sum\_\{t=1\}^\{T\}\\\|\\,P\_\{\\perp\}\\zeta\_\{t\}\\,\\\|\_\{2\}^\{2\},for somec0\>0c\_\{0\}\>0\. Summing overj=1,…,mj=1,\\dots,myields

𝔼​‖P⟂​Z‖F2≥m​c0​∑t=1T‖P⟂​ζt‖22−\(small remainders\)\.\\mathbb\{E\}\\\|P\_\{\\perp\}Z\\\|\_\{F\}^\{2\}\\;\\geq\\;mc\_\{0\}\\sum\_\{t=1\}^\{T\}\\\|P\_\{\\perp\}\\zeta\_\{t\}\\\|\_\{2\}^\{2\}\-\\text\{\(small remainders\)\}\.

#### A nontrivial fraction of quadratic energy lies outside the linear span\.

The projectorP∥P\_\{\\parallel\}only removes components that lie in the linear span of the rows ofSS\(dimension≤k\\leq k\)\. The vectorsζt\\zeta\_\{t\}live in aq=k​\(k\+1\)2q=\\tfrac\{k\(k\+1\)\}\{2\}\-dimensional quadratic feature space\. Unless the data\{st\}\\\{s\_\{t\}\\\}are algebraically degenerate \(a measure\-zero event for typical data or with small perturbation\), a constant fraction of the quadratic energy lies outside the linear span\. Formally, let𝒫quad,∥\\mathcal\{P\}\_\{\\mathrm\{quad\},\\parallel\}denote the projection of the quadratic feature space onto the subspace spanned by linear functions ofss; this subspace has dimension at mostkk\. Therefore, under the random\-design assumptions and withm≳k​log⁡Dm\\gtrsim k\\log D,

∑t=1T‖P⟂​ζt‖22≥c1​T⋅k​log⁡Dm,\\sum\_\{t=1\}^\{T\}\\\|P\_\{\\perp\}\\zeta\_\{t\}\\\|\_\{2\}^\{2\}\\geq c\_\{1\}\\,T\\cdot\\frac\{k\\log D\}\{m\},for a constantc1\>0c\_\{1\}\>0\. Combining with the previous display yields

𝔼​‖P⟂​Z‖F2≥c′​m​T⋅k​log⁡Dm=c′​T​k​log⁡D,\\mathbb\{E\}\\\|P\_\{\\perp\}Z\\\|\_\{F\}^\{2\}\\;\\geq\\;c^\{\\prime\}\\,m\\,T\\cdot\\frac\{k\\log D\}\{m\}=c^\{\\prime\}Tk\\log D,hence \(after normalizing byTT\) the per\-row average orthogonal energy is at leastc′​k​log⁡Dc^\{\\prime\}k\\log D\. Dividing by the readout scaling \(variance1/D1/D\) and tracking constants as above gives

𝔼​‖P⟂​Ymult‖F2≥c′′​m−1​k​log⁡D,\\mathbb\{E\}\\\|P\_\{\\perp\}Y\_\{\\mathrm\{mult\}\}\\\|\_\{F\}^\{2\}\\;\\geq\\;c^\{\\prime\\prime\}\\,m^\{\-1\}k\\log D,for somec′′\>0c^\{\\prime\\prime\}\>0depending onK,EmaxK,E\_\{\\max\}\.

#### A Jacobian trace lower bound certifies numerical usefulness\.

To argue that the new directions are not only algebraic but numerically useful \(i\.e\. correspond to non\-vanishing singular values\), we examine the per\-row Jacobian of the GLU\-improved P\-FFN block and lower bound its tracetr⁡\(Jg​\(x\)⊤​Jg​\(x\)\)\\operatorname\{tr\}\(J\_\{g\}\(x\)^\{\\top\}J\_\{g\}\(x\)\)\(sum of squared singular values\)\.

For a single rowxx, the Jacobian of the GLU\-improved P\-FFN block \(output w\.r\.t\. input row\) is

Jg​\(x\)\\displaystyle J\_\{g\}\(x\)=D⊤\+B⊤​diag⁡\(C⊤​x\)​diag⁡\(ϕ′​\(A⊤​x\)\)​A⊤\\displaystyle=D^\{\\top\}\+B^\{\\top\}\\operatorname\{diag\}\(C^\{\\top\}x\)\\operatorname\{diag\}\(\\phi^\{\\prime\}\(A^\{\\top\}x\)\)A^\{\\top\}\(8\)\+B⊤​diag⁡\(ϕ​\(A⊤​x\)\)​C⊤\.\\displaystyle\\qquad\+B^\{\\top\}\\operatorname\{diag\}\(\\phi\(A^\{\\top\}x\)\)C^\{\\top\}\.The multiplicative term contributes the middle two summands\. Taking expectation overA,C,BA,C,B\(and over inputxxwhen needed\) and using that entries ofA,C,BA,C,Bare independent sub\-Gaussian with variance1/D1/D, non\-asymptotic singular\-value bounds imply a lower bound on the expected trace of orderc2​k​log⁡Dc\_\{2\}k\\log Dup to subtracting the purely\-additive FFN upper\-bound termCK​\(ctop​ℰtop\+clow​ℰlow\)C\_\{K\}\(c\_\{\\mathrm\{top\}\}\\mathcal\{E\}\_\{\\mathrm\{top\}\}\+c\_\{\\mathrm\{low\}\}\\mathcal\{E\}\_\{\\mathrm\{low\}\}\)arising from the activation\-response constants\. Intuitively, the multiplicative Jacobian collectsk​log⁡Dk\\log Dworth of squared singular value mass from the randomized quadratic lifting \(thelog⁡D\\log Dfactor reflects the Johnson–Lindenstrauss style embedding stability needed forDDoutputs\)\. The residualD⊤D^\{\\top\}preserves original directions and therefore does not cancel this new mass\.

#### Concentration and completion of the argument\.

Hanson–Wright and Matrix Bernstein concentration applied to the trace terms show that the empirical per\-row average1T​∑t=1Ttr⁡\(Jg​\(xt\)⊤​Jg​\(xt\)\)\\tfrac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\operatorname\{tr\}\(J\_\{g\}\(x\_\{t\}\)^\{\\top\}J\_\{g\}\(x\_\{t\}\)\)concentrates around its expectation with high probability\. Hence the per\-row average trace isΩ​\(k​log⁡D\)\\Omega\(k\\log D\)with probability1−O​\(exp⁡\(−c​k\)\)1\-O\(\\exp\(\-ck\)\)\.

Combining the orthogonal Frobenius\-energy lower bound with the Jacobian trace lower bound and the observation that the residualX​DXDpreserves energy in the original linear subspace, we conclude that the outputYYhas additional Frobenius mass in orthogonal directions of orderΘ​\(m−1​k​log⁡D\)\\Theta\(m^\{\-1\}k\\log D\)while its top singular value squared remains comparable to that ofX​DXD\. Therefore the effective rank increases by an additiveδ=Θ​\(k​log⁡D/σmax2​\(X​D\)\)\>0\\delta=\\Theta\\big\(k\\log D/\\sigma\_\{\\max\}^\{2\}\(XD\)\\big\)\>0with high probability, i\.e\.

erank⁡\(𝒢​\(X\)\)≥erank⁡\(X\)\+δ,\\operatorname\{erank\}\(\\mathcal\{G\}\(X\)\)\\geq\\operatorname\{erank\}\(X\)\+\\delta,with probability at least1−exp⁡\(−c​k\)1\-\\exp\(\-ck\), completing the proof\.

Similar Articles

Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

arXiv cs.CL

This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.