SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks
Summary
SinkRec introduces a hybrid memory-transition architecture to mitigate semantic state sink in long sequence recommendation, using memory-conditioned gated delta networks to decouple pattern storage from dynamic modeling, achieving linear-time efficiency.
View Cached Full Text
Cached at: 06/10/26, 06:16 AM
# SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks
Source: [https://arxiv.org/html/2606.09888](https://arxiv.org/html/2606.09888)
Zhuang Zhuang1, Zhipeng Wei1, Ji Dai2, Jie Chen1, Fei Pan111footnotemark:1, Peng Jiang1, Kun Gai3 1Kuaishou Technology, Beijing, China 2Beijing University of Posts and Telecommunications, Beijing, China 3Independent Researcher \{zhuangzhuang,weizhipeng,chenjie20,panfei05,jiangpeng\}@kuaishou\.com daiji@bupt\.edu\.cn,gai\.kun@qq\.com
###### Abstract
Linear attention provides an efficient backbone for long\-sequence recommendation by avoiding the quadratic cost of standard Transformers, but its compressed recurrent state can be dominated by repetitive behavior patterns\. We identify this phenomenon assemantic state sink, where recurring semantics over\-occupy the recurrent state and bias subsequent readouts\. To mitigate semantic state sink, we proposeSinkRec, a hybrid memory\-transition looped architecture that decouples collaborative behavioral pattern storage from dynamic transition modeling\. SinkRec externalizes recurring local patterns into a learnable conditional memory through residual vector quantization, reinjects the retrieved codes, and exposes memory key\-value pairs to the attention block\. It further introduces Temporal\-Aware State\-Relation Differential Gated DeltaNet \(TDGD\), which uses memory to purify recurrent writing and reading by suppressing memory\-covered updates and removing memory\-aligned readout responses\. This design turns recurring semantics from state\-competing signals into memory\-retrievable patterns, allowing the recurrent state to focus on dynamic transitions and alleviating semantic state sink with linear\-time efficiency\. Experiments on public and industrial datasets demonstrate the effectiveness and efficiency of SinkRec\.
## 1Introduction
Sequential recommendation is fundamental to personalized services, such as streaming media and e\-commerce platforms, as it models users’ historical behavior sequences to capture personalized interests and deliver relevant content to users\[[35](https://arxiv.org/html/2606.09888#bib.bib35),[36](https://arxiv.org/html/2606.09888#bib.bib36),[26](https://arxiv.org/html/2606.09888#bib.bib26)\]\. As such, sequential modeling has become a fundamental approach for capturing evolving user interests\. Early recommendation architectures adopted temporal models such as Markov chains, RNNs, and Transformers, but were mostly applied to short sequences \(lengths of10210^\{2\}–10310^\{3\}\)\. In contrast, full long sequences \(length\>10310^\{3\}\) reveal long\-term preferences, recurring interests, and delayed dependencies, improving recommendation accuracy and helping mitigate the information cocoon effect\. This makes scalable long\-sequence modeling a critical step toward more comprehensive and less myopic user preference modeling\.
Existing long\-sequence recommendation methods often trade off efficiency and completeness\. Search\-based methods\[[31](https://arxiv.org/html/2606.09888#bib.bib31),[32](https://arxiv.org/html/2606.09888#bib.bib32),[21](https://arxiv.org/html/2606.09888#bib.bib21),[3](https://arxiv.org/html/2606.09888#bib.bib3)\]reduce computation by retrieving partial histories, but introduce two\-stage serving complexity and incomplete interest estimation\. End\-to\-end models\[[29](https://arxiv.org/html/2606.09888#bib.bib29)\]retain richer historical signals but incur rapidly increasing computation\. This motivates efficient backbones such as linear attention, which scale to long histories while preserving sequence perception\.
Although linear attention offers an efficient backbone for long\-sequence recommendation, its recurrent\-state formulation creates a new bottleneck for long\-history modeling\. It compresses the whole history into a finite state matrix, where each behavior writes its key\-value information into the state used for future prediction\. This avoids quadratic attention cost, but also couples recurring semantic storage with dynamic transition modeling\. In long histories, recurring behaviors provide useful preference regularities, whereas sparse transitions reflect changes in current intent\. When both share the same compressed state, repetitive semantics can be repeatedly reinforced and interfere with transition signals needed for the current prediction\. This motivates a central question:
How can recurrent linear attention exploit long user histories without allowing repetitive semantics to dominate the recurrent state?
To answer this question, we analyze how Gated DeltaNet\-style recurrent attention carries historical behaviors into the current prediction\. Figure[1](https://arxiv.org/html/2606.09888#S1.F1)illustrates a case where early food\-related behaviors match the actual next item, while recent repetitive travel\-related behaviors are target\-irrelevant but receive disproportionately high historical influence scores \(detailed in Appendix[C](https://arxiv.org/html/2606.09888#A3)\) in vanilla Gated DeltaNet\. This indicates that the current prediction is dominated by sink\-like travel semantics rather than target\-relevant food signals\. This exposes a key challenge: the compressed recurrent state is required to serve simultaneously as semantic memory and transition operator, making it vulnerable to semantic state sink when repetitive patterns are over\-retained and dominate subsequent readouts\.
Figure 1:Motivating example of repetitive semantic state sink\. The y\-axis measures each past behavior’s contribution to the current prediction through the recurrent state; higher values indicate stronger occupation of the predictive state\. Vanilla recurrent states are dominated by repeated irrelevant semantics, while SinkRec suppresses this sink and preserves target\-relevant signals\.To address the challenges introduced by the semantic state sink phenomenon, we proposeSinkRec, a memory\-transition decoupled framework for efficient long\-sequence recommendation through a looped hybrid architecture\. The key insight is to separate collaborative semantic storage from dynamic transition modeling: recurring local behavior patterns are externalized into conditional memory, while the recurrent state is reserved for memory\-unexplained transitions\. Specifically, SinkRec consists of two complementary components: \(i\)Conditional Memory Modulecompresses local behavior windows into learnable residual vector\-quantized \(VQ\) codes, reinjects the retrieved codes into the sequence, and exposes memory key–value pairs to the downstream attention block\. \(ii\)Temporal\-Aware State\-Relation Differential Gated DeltaNet\(TDGD\) performs time\-aware recurrent modeling and uses the memory pairs to purify state writing and reading: memory\-covered updates are suppressed before being written, and memory\-aligned responses are removed during readout\. In this way, SinkRec leverages shared behavioral patterns without repeatedly accumulating them in the recurrent state, mitigating semantic state sink while preserving the efficiency of recurrent linear attention\. Moreover, parameter sharing across hybrid architecture blocks keeps SinkRec compact while maintaining strong recommendation performance\.
Our contributions are summarized as follows:
- •We identify the*semantic state sink*phenomenon in Gated DeltaNet\-style long\-sequence recommendation, where semantically repetitive patterns can over\-occupy the compressed recurrent state and bias subsequent readouts\.
- •We proposeSinkRec, a memory\-transition decoupled framework that externalizes recurring local patterns into conditional memory and uses Temporal\-Aware State\-Relation Differential Gated DeltaNet \(TDGD\) to purify recurrent writing and reading from memory\-covered semantics\.
- •We conduct extensive experiments on two public datasets and one industrial dataset\. The results show that SinkRec consistently outperforms strong baselines with fewer parameters, demonstrating both effectiveness and efficiency for long\-sequence recommendation\.
## 2Related Work
Long Sequential Recommendation Architectures\.Scaling long\-sequence user interaction histories has been increasingly explored as an effective means of improving recommender model performance\[[14](https://arxiv.org/html/2606.09888#bib.bib14)\]\. Existing methods commonly rely on attention mechanisms to capture complex user transition patterns and facilitate personalized recommendation, while continuously seeking to exploit the information gain brought by long histories under manageable computational costs\. Early studies such as DIN\[[31](https://arxiv.org/html/2606.09888#bib.bib31)\]and SIM\[[21](https://arxiv.org/html/2606.09888#bib.bib21)\]adopt search\-based mechanisms to retrieve valuable subsets from historical interaction sequences, inspiring subsequent methods such as VISTA\[[5](https://arxiv.org/html/2606.09888#bib.bib5)\], which caches user histories into a few hundred compact tokens to jointly support prediction\. Later, HSTU\[[29](https://arxiv.org/html/2606.09888#bib.bib29)\]reframes recommendation as a sequential transduction task and customizes attention mechanisms for large\-scale, non\-stationary recommendation data\. More recently, LONGER\[[2](https://arxiv.org/html/2606.09888#bib.bib2)\], HiSAC\[[28](https://arxiv.org/html/2606.09888#bib.bib28)\], and GEMs\[[33](https://arxiv.org/html/2606.09888#bib.bib33)\]further exploit Transformer\-based architectures to model relevance\-driven interactions\. However, the quadratic complexity of Transformer\-based attention limits its scalability to ultra\-long user histories, motivating the emergence of linear\-complexity alternatives\. Methods such as RankMixer\[[34](https://arxiv.org/html/2606.09888#bib.bib34)\]and UniMixer\[[10](https://arxiv.org/html/2606.09888#bib.bib10)\]enhance the expressive capacity of mixing modules to improve interaction modeling\. BlossomRec\[[19](https://arxiv.org/html/2606.09888#bib.bib19)\]employs sparse attention to capture both long\- and short\-term user interests, while FuXi\-Linear\[[27](https://arxiv.org/html/2606.09888#bib.bib27)\]integrates linear attention with temporal features for efficient long\-sequence modeling\. Nevertheless, these methods inherently lack a knowledge lookup mechanism, forcing them to approximate transition relations purely through continuous computation rather than retrieving collaborative semantic patterns from external or structured memory\.
Memory\-Augmented Model Scaling\.Recent studies\[[1](https://arxiv.org/html/2606.09888#bib.bib1),[6](https://arxiv.org/html/2606.09888#bib.bib6)\]have shown that incorporating memory modules into model backbones can enhance model capacity and improve scaling behavior\. In the field of large language models\[[30](https://arxiv.org/html/2606.09888#bib.bib30)\], LongMem\[[23](https://arxiv.org/html/2606.09888#bib.bib23)\]introduces a long\-term memory and retrieval mechanism to efficiently leverage ultra\-long contexts\. UltraMem\[[11](https://arxiv.org/html/2606.09888#bib.bib11)\]and UltraMemV2\[[12](https://arxiv.org/html/2606.09888#bib.bib12)\]replace sparsely activated experts with efficient memory layers, thereby reducing memory\-access overhead\. Engram\[[6](https://arxiv.org/html/2606.09888#bib.bib6)\]further proposes conditional memory as a complementary sparse dimension for scaling the capacity of LLMs\. In recommender systems, early work such as MIMN\[[20](https://arxiv.org/html/2606.09888#bib.bib20)\]captures evolving user interests for long\-sequence modeling\. More recently, MSN\[[24](https://arxiv.org/html/2606.09888#bib.bib24)\]and related methods\[[18](https://arxiv.org/html/2606.09888#bib.bib18),[17](https://arxiv.org/html/2606.09888#bib.bib17),[4](https://arxiv.org/html/2606.09888#bib.bib4)\]retrieve personalized representations from large parameterized memories and aggregate them into downstream feature\-interaction modules\. However, existing recommender models primarily exploit the retrieval capability of memory modules, while largely overlooking their potential to complement and collaborate with sequential modeling modules within a unified architecture\.
## 3Preliminaries
### 3\.1Gated Delta Networks
Linear Transformers improve efficiency over standard Transformers, but their reduced contextual interaction often limits performance on long\-context tasks\. Gated DeltaNet addresses this issue by extending DeltaNet with adaptive memory\-control gates and a delta\-update rule\. Given the query, key, and value vectors𝐪t\\mathbf\{q\}\_\{t\},𝐤t\\mathbf\{k\}\_\{t\}, and𝐯t\\mathbf\{v\}\_\{t\}at steptt, Gated DeltaNet maintains a key\-addressed recurrent state:
𝐒t=αt𝐒t−1\(𝐈−βt𝐤t𝐤t⊤\)\+βt𝐯t𝐤t⊤,𝐨t=𝐒t𝐪t,\\mathbf\{S\}\_\{t\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\}\\left\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\\right\)\+\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\},\\qquad\\mathbf\{o\}\_\{t\}=\\mathbf\{S\}\_\{t\}\\mathbf\{q\}\_\{t\},\(1\)where𝐒t∈ℝdv×dk\\mathbf\{S\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{v\}\\times d\_\{k\}\}is the recurrent state,αt\\alpha\_\{t\}controls state retention, andβt\\beta\_\{t\}controls the delta\-update strength\. With the cumulative decayγj=∏i=1jαi\\gamma\_\{j\}=\\prod\_\{i=1\}^\{j\}\\alpha\_\{i\}, the recurrence admits an attention\-like form:
𝐨t=∑i=1t𝐯i\(γtγi𝐤i⊤𝐪t\),𝐎=\(𝐐𝐊⊤⊙𝚪\)𝐕,\\mathbf\{o\}\_\{t\}=\\sum\_\{i=1\}^\{t\}\\mathbf\{v\}\_\{i\}\\left\(\\frac\{\\gamma\_\{t\}\}\{\\gamma\_\{i\}\}\\mathbf\{k\}\_\{i\}^\{\\top\}\\mathbf\{q\}\_\{t\}\\right\),\\qquad\\mathbf\{O\}=\\left\(\\mathbf\{Q\}\\mathbf\{K\}^\{\\top\}\\odot\\mathbf\{\\Gamma\}\\right\)\\mathbf\{V\},\(2\)where𝚪∈ℝL×L\\mathbf\{\\Gamma\}\\in\\mathbb\{R\}^\{L\\times L\}is a causal decay mask that assigns decay\-aware weights to visible historical positions and masks future positions\.
For efficient training, Gated DeltaNet adopts a chunkwise parallel formulation\. The transition matrix𝐈−βt𝐤t𝐤t⊤\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}can be viewed as a generalized Householder transformation, whose cumulative product is represented with a WY\-style factorization under a partially expanded recurrence, enabling efficient parallel computation\.
### 3\.2Semantic State Sink
To understand why Gated DeltaNet\-style recurrent attention may underutilize long user histories, we analyze the historical influence on the current prediction in Figure[1](https://arxiv.org/html/2606.09888#S1.F1)\. It shows that early target\-relevant behaviors can still provide useful long\-range signals, but repeated target\-irrelevant semantics in recent history may produce increasingly dominant state responses in vanilla Gated DeltaNet\. As a result, the recurrent state is biased toward a sink\-like semantic direction, causing the prediction to over\-rely on recurring irrelevant patterns while weakening the influence of target\-relevant long\-range behaviors\.
Phenomenon 1\. \(Semantic State Sink\)Semantic state sink refers to the over\-retention of semantically repetitive patterns in the recurrent state, where a few semantic directions dominate the state readout and bias subsequent predictions\.
This phenomenon reveals both the opportunity and challenge of long\-sequence recommendation\. Long histories contain collaborative behavioral patterns beyond short\-term contexts, but directly accumulating them in the recurrent state couples semantic storage with transition modeling\. As repetitive patterns are repeatedly written into similar state directions, they may form sink\-like semantic directions that dominate subsequent readouts and suppress memory\-unexplained transition signals\. Thus, collaborative patterns should be externalized into memory, while recurrent computation should focus on memory\-unexplained transitions\. SinkRec follows this principle by combining conditional memory with TDGD to leverage shared behavioral patterns and purify recurrent writing and reading from memory\-covered semantics\.
Figure 2:The Architecture of SinkRec for Long Sequence Recommendation\. SinkRec consists ofLLstacked hybrid architecture blocks, each composed of a conditional memory module and a Gated DeltaNet\-style linear attention module\. The conditional memory module encodes contiguous and dilated behavior windows into residual\-quantized memory representations, using the dilated window to enlarge the historical receptive field and sharing the resulting memory key–value pairs with the linear attention module\. The linear attention module then performs temporal\-aware Gated DeltaNet updates with memory\-conditioned writing and state\-relation differential readout\.
## 4Method
### 4\.1Hybrid Model Architecture Overview
As shown in Figure[2](https://arxiv.org/html/2606.09888#S3.F2), SinkRec adopts a looped hybrid architecture that combines conditional memory with TDGD to separate collaborative pattern storage from dynamic transition modeling\[[9](https://arxiv.org/html/2606.09888#bib.bib9)\]\. Section[4\.2](https://arxiv.org/html/2606.09888#S4.SS2)provides the memory side, which preserves recurring local behavior patterns outside the recurrent state\. Section[4\.3](https://arxiv.org/html/2606.09888#S4.SS3)provides the dynamic computation side, which purifies recurrent writing and reading by suppressing memory\-covered updates and removing memory\-aligned readout responses\. Together, they allow SinkRec to reuse recurring patterns while preventing them from repeatedly dominating the recurrent state\. We further provide a theoretical analysis of semantic state sink in Appendix[A](https://arxiv.org/html/2606.09888#A1), showing how repetitive semantic patterns can dominate recurrent state readouts\.
### 4\.2Conditional Memory Module
Long sequence recommendation involves recurring interests that appear as collaborative behavior patterns\. Motivated by this, SinkRec reduces redundant state encoding by externalizing them into a learnable conditional memory, where compressed behavior windows are residual\-quantized, reinjected, and exposed as memory key–value pairs to TDGD\.
#### Behavior Window Encoding\.
Given the dense behavioral representation𝐱1:T\\mathbf\{x\}\_\{1:T\}, SinkRec constructs two local windows for each positiontt: a contiguous window and a dilated window\. The contiguous window captures short\-range behavioral transitions, while the dilated window enlarges the receptive field over earlier interactions:
𝐖tcon=\[𝐱t−W\+1,…,𝐱t\],𝐖tdil=\[𝐱t−\(W−1\)s,…,𝐱t−s,𝐱t\],\\mathbf\{W\}\_\{t\}^\{\\mathrm\{con\}\}=\[\\mathbf\{x\}\_\{t\-W\+1\},\\ldots,\\mathbf\{x\}\_\{t\}\],\\qquad\\mathbf\{W\}\_\{t\}^\{\\mathrm\{dil\}\}=\[\\mathbf\{x\}\_\{t\-\(W\-1\)s\},\\ldots,\\mathbf\{x\}\_\{t\-s\},\\mathbf\{x\}\_\{t\}\],\(3\)whereWWis the window size andssis the dilation stride\. The consecutive and dilated windows are encoded separately and then summed as the compressed local behavior representation, i\.e\.,𝐡t=Enccon\(𝐖tcon\)\+Encdil\(𝐖tdil\)\\mathbf\{h\}\_\{t\}=\\mathrm\{Enc\}\_\{\\mathrm\{con\}\}\(\\mathbf\{W\}\_\{t\}^\{\\mathrm\{con\}\}\)\+\\mathrm\{Enc\}\_\{\\mathrm\{dil\}\}\(\\mathbf\{W\}\_\{t\}^\{\\mathrm\{dil\}\}\), which is used for subsequent memory quantization\.
#### Learnable Residual Vector Quantization\.
SinkRec discretizes𝐡t\\mathbf\{h\}\_\{t\}using a multi\-level learnable residual VQ codebook\. Starting from𝐫t0=𝐡t\\mathbf\{r\}\_\{t\}^\{0\}=\\mathbf\{h\}\_\{t\}, theℓ\\ell\-th quantization level selects the nearest codeword and updates the residual as:
ctℓ=argmink‖𝐫tℓ−𝐞kℓ‖22,𝐪tℓ=𝐞ctℓℓ,𝐫tℓ\+1=𝐫tℓ−𝐪tℓ,𝐳tq=∑ℓ=0L−1𝐪tℓ,c\_\{t\}^\{\\ell\}=\\arg\\min\_\{k\}\\left\\\|\\mathbf\{r\}\_\{t\}^\{\\ell\}\-\\mathbf\{e\}\_\{k\}^\{\\ell\}\\right\\\|\_\{2\}^\{2\},\\quad\\mathbf\{q\}\_\{t\}^\{\\ell\}=\\mathbf\{e\}\_\{c\_\{t\}^\{\\ell\}\}^\{\\ell\},\\quad\\mathbf\{r\}\_\{t\}^\{\\ell\+1\}=\\mathbf\{r\}\_\{t\}^\{\\ell\}\-\\mathbf\{q\}\_\{t\}^\{\\ell\},\\quad\\mathbf\{z\}\_\{t\}^\{q\}=\\sum\_\{\\ell=0\}^\{L\-1\}\\mathbf\{q\}\_\{t\}^\{\\ell\},\(4\)where𝐞kℓ\\mathbf\{e\}\_\{k\}^\{\\ell\}denotes thekk\-th learnable codeword in theℓ\\ell\-th codebook\. Rather than relying on static semantic clusters, the codebooks are optimized end\-to\-end to learn recurring behavioral patterns\.
#### Memory Output\.
The quantized representation𝐳tq\\mathbf\{z\}\_\{t\}^\{q\}produces two types of outputs\. First, it generates a code sequence that is injected back into the main sequence representation:
𝐤tq=𝐖kq𝐳tq,𝐯tq=𝐖vq𝐳tq,gtq=σ\(Norm\(𝐤tq\)⊤Norm\(𝐱t\)d\),\\mathbf\{k\}\_\{t\}^\{q\}=\\mathbf\{W\}\_\{k\}^\{q\}\\mathbf\{z\}\_\{t\}^\{q\},\\qquad\\mathbf\{v\}\_\{t\}^\{q\}=\\mathbf\{W\}\_\{v\}^\{q\}\\mathbf\{z\}\_\{t\}^\{q\},\\qquad g\_\{t\}^\{q\}=\\sigma\\left\(\\frac\{\\mathrm\{Norm\}\(\\mathbf\{k\}\_\{t\}^\{q\}\)^\{\\top\}\\mathrm\{Norm\}\(\\mathbf\{x\}\_\{t\}\)\}\{\\sqrt\{d\}\}\\right\),\(5\)𝐜t=gtq𝐯tq,𝐱t=𝐱t\+𝐜t,\\mathbf\{c\}\_\{t\}=g\_\{t\}^\{q\}\\mathbf\{v\}\_\{t\}^\{q\},\\qquad\\mathbf\{x\}\_\{t\}=\\mathbf\{x\}\_\{t\}\+\\mathbf\{c\}\_\{t\},\(6\)where𝐜t\\mathbf\{c\}\_\{t\}denotes the retrieved code representation injected through a residual connection\. The projected quantized representations are also exposed to TDGD as memory pairs,𝐤mem,t=𝐤tq\\mathbf\{k\}\_\{\\mathrm\{mem\},t\}=\\mathbf\{k\}\_\{t\}^\{q\}and𝐯mem,t=𝐯tq\\mathbf\{v\}\_\{\\mathrm\{mem\},t\}=\\mathbf\{v\}\_\{t\}^\{q\}, providing semantic addresses and collaborative behavioral content for memory\-conditioned writing and differential readout\.
#### Training Objective\.
The memory module is optimized with residual VQ and a reconstruction loss:
ℒmem=∑ℓ=0L−1\(‖sg\[𝐡t\]−𝐪tℓ‖22\+βc‖𝐡t−sg\[𝐪tℓ\]‖22\)\+λrec‖Dec\(𝐳tq\)−𝐘t‖22,\\mathcal\{L\}\_\{\\mathrm\{mem\}\}=\\sum\_\{\\ell=0\}^\{L\-1\}\\left\(\\\|\\mathrm\{sg\}\[\\mathbf\{h\}\_\{t\}\]\-\\mathbf\{q\}\_\{t\}^\{\\ell\}\\\|\_\{2\}^\{2\}\+\\beta\_\{c\}\\\|\\mathbf\{h\}\_\{t\}\-\\mathrm\{sg\}\[\\mathbf\{q\}\_\{t\}^\{\\ell\}\]\\\|\_\{2\}^\{2\}\\right\)\+\\lambda\_\{\\mathrm\{rec\}\}\\left\\\|\\mathrm\{Dec\}\(\\mathbf\{z\}\_\{t\}^\{q\}\)\-\\mathbf\{Y\}\_\{t\}\\right\\\|\_\{2\}^\{2\},\(7\)wheresg\[⋅\]\\mathrm\{sg\}\[\\cdot\]denotes the stop\-gradient operator,βc\\beta\_\{c\}controls the commitment strength,𝐘t\\mathbf\{Y\}\_\{t\}is the subsequent behavior window used as reconstruction supervision for the next\-behavior prediction context\. The final objective isℒ=ℒpred\+λvqℒmem\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{pred\}\}\+\\lambda\_\{\\mathrm\{vq\}\}\\mathcal\{L\}\_\{\\mathrm\{mem\}\}\.
### 4\.3Temporal\-Aware State\-Relation Differential Gated DeltaNet
We identify*semantic state sink*in recurrent linear attention, where recurring semantic interests repeatedly accumulate in the recurrent state and dominate future readouts\. TDGD mitigates this issue by using external semantic memory to regulate both state writing and state reading\.
#### Temporal\-aware Gated DeltaNet\.
Given the hidden representation𝐱t\\mathbf\{x\}\_\{t\}, TDGD obtains the block\-level gate, value, query, and key vectors through a joint SiLU activated projection:
\[𝐮t,𝐯t,𝐪t,𝐤t\]=Split\(SiLU\(𝐱t𝐖uvqk\)\),\\left\[\\mathbf\{u\}\_\{t\},\\mathbf\{v\}\_\{t\},\\mathbf\{q\}\_\{t\},\\mathbf\{k\}\_\{t\}\\right\]=\\operatorname\{Split\}\\left\(\\operatorname\{SiLU\}\\left\(\\mathbf\{x\}\_\{t\}\\mathbf\{W\}\_\{uvqk\}\\right\)\\right\),\(8\)where𝐪t\\mathbf\{q\}\_\{t\}and𝐤t\\mathbf\{k\}\_\{t\}address the recurrent state,𝐯t\\mathbf\{v\}\_\{t\}provides the written content, and𝐮t\\mathbf\{u\}\_\{t\}serves as a block\-level output gate that modulates the recurrent readout before the feed\-forward transformation\. Following the gated delta rule, vanilla Gated DeltaNet maintains a recurrent key–value state as:
𝐒t=αt𝐒t−1\(𝐈−βt𝐤t𝐤t⊤\)\+βt𝐯t𝐤t⊤,𝐫t=𝐒t𝐪t,\\mathbf\{S\}\_\{t\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\}\\left\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\\right\)\+\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\},\\qquad\\mathbf\{r\}\_\{t\}=\\mathbf\{S\}\_\{t\}\\mathbf\{q\}\_\{t\},\(9\)whereαt\\alpha\_\{t\}controls state retention andβt\\beta\_\{t\}controls write strength\. The readout is further normalized, projected, and modulated by𝐮t\\mathbf\{u\}\_\{t\}before being passed to the downstream feed\-forward layer\.
However, user behaviors in long sequences are highly time\-dependent\. TDGD therefore injects temporal signals into both query\-key addressing and recurrent gating\. Following the multi\-granularity temporal encoding in FuXi\-Linear\[[27](https://arxiv.org/html/2606.09888#bib.bib27)\], we define temporal periodsPj=Bb0\+jP\_\{j\}=B^\{b\_\{0\}\+j\}and represent each timestampτ\\tauwith periodic phases\{sin\(2πτ/Pj\),cos\(2πτ/Pj\)\}j=0m−1\\\{\\sin\(2\\pi\\tau/P\_\{j\}\),\\cos\(2\\pi\\tau/P\_\{j\}\)\\\}\_\{j=0\}^\{m\-1\}, denoted asϕ\(τ\)\\boldsymbol\{\\phi\}\(\\tau\)\. The key is conditioned on the current interaction time, while the query is conditioned on the target time:
𝐤~t=Norm\(𝐤t\+𝐖kτϕ\(τt\)\),𝐪~t=Norm\(𝐪t\+𝐖qτϕ\(τt\+1\)\)\.\\tilde\{\\mathbf\{k\}\}\_\{t\}=\\operatorname\{Norm\}\\left\(\\mathbf\{k\}\_\{t\}\+\\mathbf\{W\}\_\{k\}^\{\\tau\}\\boldsymbol\{\\phi\}\(\\tau\_\{t\}\)\\right\),\\qquad\\tilde\{\\mathbf\{q\}\}\_\{t\}=\\operatorname\{Norm\}\\left\(\\mathbf\{q\}\_\{t\}\+\\mathbf\{W\}\_\{q\}^\{\\tau\}\\boldsymbol\{\\phi\}\(\\tau\_\{t\+1\}\)\\right\)\.\(10\)
The time interval and periodic feature jointly modulate state retention:
ωt=exp\(−λΔlog\(1\+ΔttτΔ\)\)⏟interval\-aware temporal decay⋅σ\(𝐰ϕ⊤ϕ\(τt\)\)⏟periodic preference,\\omega\_\{t\}=\\underbrace\{\\exp\\left\(\-\\lambda\_\{\\Delta\}\\log\\left\(1\+\\frac\{\\Delta t\_\{t\}\}\{\\tau\_\{\\Delta\}\}\\right\)\\right\)\}\_\{\\text\{interval\-aware temporal decay\}\}\\cdot\\underbrace\{\\sigma\\left\(\\mathbf\{w\}\_\{\\phi\}^\{\\top\}\\boldsymbol\{\\phi\}\(\\tau\_\{t\}\)\\right\)\}\_\{\\text\{periodic preference\}\},\(11\)whereΔtt\\Delta t\_\{t\}denotes the interaction interval, andτΔ\\tau\_\{\\Delta\}andλΔ\\lambda\_\{\\Delta\}control the scale and strength of temporal decay\. The final retention and write gates are:
αt=σ\(𝐱t𝐖α\)ωt,βt=σ\(𝐱t𝐖β\+𝐖ΔΔtt\)\.\\alpha\_\{t\}=\\sigma\(\\mathbf\{x\}\_\{t\}\\mathbf\{W\}\_\{\\alpha\}\)\\,\\omega\_\{t\},\\qquad\\beta\_\{t\}=\\sigma\\left\(\\mathbf\{x\}\_\{t\}\\mathbf\{W\}\_\{\\beta\}\+\\mathbf\{W\}\_\{\\Delta\}\\Delta t\_\{t\}\\right\)\.\(12\)
With these temporal components, TDGD writes the recurrent update in innovation form:
𝐒¯t−1=αt𝐒t−1,𝜹t=𝐯t−𝐒¯t−1𝐤~t,𝐒t=𝐒¯t−1\+βt𝜹t𝐤~t⊤\.\\bar\{\\mathbf\{S\}\}\_\{t\-1\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\},\\qquad\\boldsymbol\{\\delta\}\_\{t\}=\\mathbf\{v\}\_\{t\}\-\\bar\{\\mathbf\{S\}\}\_\{t\-1\}\\tilde\{\\mathbf\{k\}\}\_\{t\},\\qquad\\mathbf\{S\}\_\{t\}=\\bar\{\\mathbf\{S\}\}\_\{t\-1\}\+\\beta\_\{t\}\\boldsymbol\{\\delta\}\_\{t\}\\tilde\{\\mathbf\{k\}\}\_\{t\}^\{\\top\}\.\(13\)The raw recurrent readout is then obtained as𝐫t=𝐒t𝐪~t\\mathbf\{r\}\_\{t\}=\\mathbf\{S\}\_\{t\}\\tilde\{\\mathbf\{q\}\}\_\{t\}\. This enables TDGD to preserve temporally relevant states while adapting updates to time\-aware transitions\.
#### Memory\-conditioned write gate\.
The memory module retrieves codebook\-based key–value representations𝐤mem,t\\mathbf\{k\}\_\{\\mathrm\{mem\},t\}and𝐯mem,t\\mathbf\{v\}\_\{\\mathrm\{mem\},t\}\. We first map them into the GDN write space through SiLU\-activated projections:
𝐤tm=SiLU\(𝐤mem,t𝐖km\),𝐯tm=SiLU\(𝐯mem,t𝐖vm\)\.\\mathbf\{k\}\_\{t\}^\{m\}=\\operatorname\{SiLU\}\\left\(\\mathbf\{k\}\_\{\\mathrm\{mem\},t\}\\mathbf\{W\}\_\{km\}\\right\),\\qquad\\mathbf\{v\}\_\{t\}^\{m\}=\\operatorname\{SiLU\}\\left\(\\mathbf\{v\}\_\{\\mathrm\{mem\},t\}\\mathbf\{W\}\_\{vm\}\\right\)\.\(14\)
To determine whether the current state update is already covered by the retrieved memory, we compare the current write pattern with the memory\-induced write pattern:
𝐏t=𝐯t𝐤~t⊤,𝐌t=𝐯tm\(𝐤tm\)⊤,\\mathbf\{P\}\_\{t\}=\\mathbf\{v\}\_\{t\}\\tilde\{\\mathbf\{k\}\}\_\{t\}^\{\\top\},\\qquad\\mathbf\{M\}\_\{t\}=\\mathbf\{v\}\_\{t\}^\{m\}\(\\mathbf\{k\}\_\{t\}^\{m\}\)^\{\\top\},\(15\)where𝐏t\\mathbf\{P\}\_\{t\}denotes the current key–value write, and𝐌t\\mathbf\{M\}\_\{t\}denotes the memory\-side write\. We compute their write magnitudes and overlap as:
atw=‖𝐏t‖F2,btw=‖𝐌t‖F2,stw=\(𝐯t⊤𝐯tm\)\(𝐤~t⊤𝐤tm\),a\_\{t\}^\{w\}=\\\|\\mathbf\{P\}\_\{t\}\\\|\_\{F\}^\{2\},\\qquad b\_\{t\}^\{w\}=\\\|\\mathbf\{M\}\_\{t\}\\\|\_\{F\}^\{2\},\\qquad s\_\{t\}^\{w\}=\(\\mathbf\{v\}\_\{t\}^\{\\top\}\\mathbf\{v\}\_\{t\}^\{m\}\)\(\\tilde\{\\mathbf\{k\}\}\_\{t\}^\{\\top\}\\mathbf\{k\}\_\{t\}^\{m\}\),\(16\)whereatwa\_\{t\}^\{w\}andbtwb\_\{t\}^\{w\}measure the strengths of the two writes, whilestws\_\{t\}^\{w\}measures their key–value overlap\. The memory\-unexplained write energy is:
etw=atw\+λw2btw−2λwstw=‖𝐏t−λw𝐌t‖F2,e\_\{t\}^\{w\}=a\_\{t\}^\{w\}\+\\lambda\_\{w\}^\{2\}b\_\{t\}^\{w\}\-2\\lambda\_\{w\}s\_\{t\}^\{w\}=\\left\\\|\\mathbf\{P\}\_\{t\}\-\\lambda\_\{w\}\\mathbf\{M\}\_\{t\}\\right\\\|\_\{F\}^\{2\},\(17\)whereλw\\lambda\_\{w\}is a learnable scale for calibrating the memory\-side write\.
However, the gate should depend on both the residual energy and the scale of the current and memory\-side writes\. Therefore, TDGD converts the overlap, residual energy, and write magnitudes into a memory coverage score:
ctw=σ\(𝐖c⊤Φw\(atw,btw,stw,etw\)\+bw\),c\_\{t\}^\{w\}=\\sigma\\left\(\\mathbf\{W\}\_\{c\}^\{\\top\}\\Phi\_\{w\}\(a\_\{t\}^\{w\},b\_\{t\}^\{w\},s\_\{t\}^\{w\},e\_\{t\}^\{w\}\)\+b\_\{w\}\\right\),\(18\)whereΦw\(⋅\)\\Phi\_\{w\}\(\\cdot\)is a compact coverage feature that summarizes the write overlap, residual energy, and write magnitudes\. The write gate is then defined as:
gtw=1−ηwctw,𝐯¯t=gtw𝐯t\.g\_\{t\}^\{w\}=1\-\\eta\_\{w\}c\_\{t\}^\{w\},\\qquad\\bar\{\\mathbf\{v\}\}\_\{t\}=g\_\{t\}^\{w\}\\mathbf\{v\}\_\{t\}\.\(19\)Thus, when the current write is largely covered by memory,ctwc\_\{t\}^\{w\}becomes larger and the effective value𝐯¯t\\bar\{\\mathbf\{v\}\}\_\{t\}is suppressed; otherwise, the update is preserved\. Finally, TDGD updates the recurrent state using the memory\-filtered value and the time\-aware key:
𝐒¯t−1=αt𝐒t−1,𝐒t=𝐒¯t−1\+βt\(𝐯¯t−𝐒¯t−1𝐤~t\)𝐤~t⊤\.\\bar\{\\mathbf\{S\}\}\_\{t\-1\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\},\\qquad\\mathbf\{S\}\_\{t\}=\\bar\{\\mathbf\{S\}\}\_\{t\-1\}\+\\beta\_\{t\}\\left\(\\bar\{\\mathbf\{v\}\}\_\{t\}\-\\bar\{\\mathbf\{S\}\}\_\{t\-1\}\\tilde\{\\mathbf\{k\}\}\_\{t\}\\right\)\\tilde\{\\mathbf\{k\}\}\_\{t\}^\{\\top\}\.\(20\)This write gate reduces redundant semantic accumulation in the recurrent state while retaining memory\-unexplained transition signals\.
#### State\-relation differential readout\.
Although the write gate limits future memory\-covered updates, the recurrent state may already contain accumulated semantic components\[[22](https://arxiv.org/html/2606.09888#bib.bib22)\]\. We further apply a memory\-guided differential readout to suppress these residual memory\-aligned responses at query time\. Given the normalized query and memory\-key directions, denoted by𝐪¯t\\bar\{\\mathbf\{q\}\}\_\{t\}and𝐤¯tm\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\}, we compute two readouts from the recurrent state:
𝐫tq=𝐒t𝐪¯t,𝐫tm=𝐒t𝐤¯tm,\\mathbf\{r\}\_\{t\}^\{q\}=\\mathbf\{S\}\_\{t\}\\bar\{\\mathbf\{q\}\}\_\{t\},\\qquad\\mathbf\{r\}\_\{t\}^\{m\}=\\mathbf\{S\}\_\{t\}\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\},\(21\)where𝐫tq\\mathbf\{r\}\_\{t\}^\{q\}denotes the main query readout, while𝐫tm\\mathbf\{r\}\_\{t\}^\{m\}probes the recurrent state along the memory\-conditioned direction\. We then compute the query\-memory relation score and the adaptive suppression coefficient as:
atr=ReLU\(\(𝐤¯tm\)⊤𝐪¯t\),ηtr=‖𝐫tm‖2‖𝐫tq‖2\+‖𝐫tm‖2\+ϵ,a\_\{t\}^\{r\}=\\operatorname\{ReLU\}\\left\(\(\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\}\)^\{\\top\}\\bar\{\\mathbf\{q\}\}\_\{t\}\\right\),\\qquad\\eta\_\{t\}^\{r\}=\\frac\{\\\|\\mathbf\{r\}\_\{t\}^\{m\}\\\|\_\{2\}\}\{\\\|\\mathbf\{r\}\_\{t\}^\{q\}\\\|\_\{2\}\+\\\|\\mathbf\{r\}\_\{t\}^\{m\}\\\|\_\{2\}\+\\epsilon\},\(22\)whereatra\_\{t\}^\{r\}measures whether the current query accesses the memory\-conditioned direction, andηtr\\eta\_\{t\}^\{r\}measures the relative strength of the memory\-direction response compared with the main query readout\. Therefore, the final TDGD readout is formulated as:
𝐫~t=𝐫tq−ηtratr𝐫tm=𝐒t\(𝐈−ηtratr𝐤¯tm\(𝐤¯tm\)⊤\)𝐪¯t\.\\tilde\{\\mathbf\{r\}\}\_\{t\}=\\mathbf\{r\}\_\{t\}^\{q\}\-\\eta\_\{t\}^\{r\}a\_\{t\}^\{r\}\\mathbf\{r\}\_\{t\}^\{m\}\\\\ =\\mathbf\{S\}\_\{t\}\\left\(\\mathbf\{I\}\-\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\eta\_\{t\}^\{r\}a\_\{t\}^\{r\}\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\}\(\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\}\)^\{\\top\}\}\\right\)\\bar\{\\mathbf\{q\}\}\_\{t\}\.\(23\)This memory\-guided differential readout suppresses responses along repeated memory\-aligned semantics, allowing memory\-unexplained transition signals to dominate the current prediction\. Finally, TDGD then applies output normalization and projection to the corrected readout, followed by the block\-level gate:
𝐨t=𝐮t⊙Post\(𝐫~t\),\\mathbf\{o\}\_\{t\}=\\mathbf\{u\}\_\{t\}\\odot\\operatorname\{Post\}\\left\(\\tilde\{\\mathbf\{r\}\}\_\{t\}\\right\),\(24\)wherePost\(⋅\)\\operatorname\{Post\}\(\\cdot\)denotes the output normalization and linear projection of the GDN layer\. The gate𝐮t\\mathbf\{u\}\_\{t\}further modulates the corrected readout before the lightweight linear layer, preventing noisy or redundant state responses from being uniformly propagated\.
#### Prediction and Optimization\.
SinkRec predicts the next item from the output representation𝐨t\\mathbf\{o\}\_\{t\}after stacked hybrid blocks:p\(it\+1=j∣i1:t\)=softmax\(𝐨t𝐄⊤\)jp\(i\_\{t\+1\}=j\\mid i\_\{1:t\}\)=\\operatorname\{softmax\}\(\\mathbf\{o\}\_\{t\}\\mathbf\{E\}^\{\\top\}\)\_\{j\}, where𝐄\\mathbf\{E\}is the item embedding matrix\. For efficient autoregressive training, given the positive targetit\+1i\_\{t\+1\}, we sampleNNnegative items𝒩t\\mathcal\{N\}\_\{t\}and optimize the sampled softmax loss\[[27](https://arxiv.org/html/2606.09888#bib.bib27)\]\. The final objective isℒ=ℒpred\+λvqℒmem\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{pred\}\}\+\\lambda\_\{\\mathrm\{vq\}\}\\mathcal\{L\}\_\{\\mathrm\{mem\}\}\.
### 4\.4Complexity Analysis
#### Complexity Analysis\.
SinkRec preserves the linear complexity of Gated DeltaNet with lightweight memory overhead\. LetNNbe the sequence length,ddthe hidden dimension,HHthe number of heads,WWthe window size,LLthe number of residual VQ levels, andKKthe codebook size\. The Gated DeltaNet backbone costsO\(Nd2\+Nd2/H\)O\(Nd^\{2\}\+Nd^\{2\}/H\)\. SinkRec adds two main costs: window encoding withO\(NWd\)O\(NWd\)for contiguous and dilated windows, and residual VQ lookup withO\(NLKd\)O\(NLKd\)overLLcodebooks of sizeKK\. The memory\-guided write and read operations are of the same order as the recurrent state computation, i\.e\.,O\(Nd2/H\)O\(Nd^\{2\}/H\), and do not change the asymptotic scaling\. Thus, the overall complexity isO\(Nd2\+Nd2/H\+NWd\+NLKd\)O\(Nd^\{2\}\+Nd^\{2\}/H\+NWd\+NLKd\)\. SinceWW,LL, andKKare fixed hyperparameters, SinkRec scales linearly withNN, remaining more efficient than Transformer\-based recommenders withO\(N2d\)O\(N^\{2\}d\)attention cost\.
Table 1:Dataset statistics\.
## 5Experiments
### 5\.1Models and Datasets
Models\.To comprehensively evaluate the performance of SinkRec, we consider three representative categories of long\-sequence recommendation baselines: full\-attention models, linear\-attention models, and Memory models\. Specifically, for full\-attention methods, we select SASRec\[[13](https://arxiv.org/html/2606.09888#bib.bib13)\], HSTU\[[29](https://arxiv.org/html/2606.09888#bib.bib29)\], Longer\[[2](https://arxiv.org/html/2606.09888#bib.bib2)\], BlossomRec\[[19](https://arxiv.org/html/2606.09888#bib.bib19)\], HyTRec\[[25](https://arxiv.org/html/2606.09888#bib.bib25)\], CollectiveKV\[[15](https://arxiv.org/html/2606.09888#bib.bib15)\], which model user behavior sequences through standard or enhanced attention mechanisms\. For linear\-attention methods, we compare with Mamba4Rec\[[16](https://arxiv.org/html/2606.09888#bib.bib16)\], TiM4Rec\[[7](https://arxiv.org/html/2606.09888#bib.bib7)\], FuXi\-Linear\[[27](https://arxiv.org/html/2606.09888#bib.bib27)\], which improve sequence modeling efficiency through recurrent, state\-space, or linearized attention structures\. For memory\-enhanced methods, we include MSN\[[24](https://arxiv.org/html/2606.09888#bib.bib24)\], DUIA\[[17](https://arxiv.org/html/2606.09888#bib.bib17)\], which uses memory modules to store personalized user representations for long\-term preference modeling\.
Datasets\.To evaluate the effectiveness of SinkRec, we conduct experiments on two representative public datasets for long\-sequence recommendation: MovieLens\-20M and KuaiRec\. MovieLens\-20M111https://grouplens\.org/datasets/movielens/is a standard movie recommendation benchmark with approximately 20 million ratings and rich tagging information, suitable for evaluating long\-term preference modeling\. KuaiRec222https://kuairec\.com/\[[8](https://arxiv.org/html/2606.09888#bib.bib8)\]is a public short\-video recommendation dataset with diverse feedback signals and dynamic user behaviors, providing a challenging scenario for modeling rapidly evolving user interests\. For the industrial dataset, we construct KuaiLLSR \(i\.e\., industrial dataset\) from large\-scale behavioral logs on the Kuaishou platform, where users interact with local life service short videos\. It contains eight days of uniformly sampled user\-video interaction records and is used for model training and evaluation\.
### 5\.2Comparison with Baseline
To validate the effectiveness of the proposed SinkRec, we report the overall performance of all baselines and SinkRec in Table[2](https://arxiv.org/html/2606.09888#S5.T2)\. From the results, we obtain the following observations\. First, conventional Transformer\-based methods show limited performance in long\-sequence recommendation, indicating that relying solely on contextual interactions is insufficient for effectively modeling long user histories\. In contrast, linear\-attention\-based models demonstrate strong potential, as they provide a more efficient and scalable way to capture long\-range sequential dependencies\. Second, the effectiveness of time\-aware methods such as FuXi\-Linear shows that temporal periodicity and temporal bias are important for sequential prediction\. As contextual biases, temporal signals can enhance the modeling of behavior dependencies in long sequences\. Moreover, target\-time\-aware modeling further contributes to performance improvement by aligning historical behaviors with the prediction time\. Third, SinkRec consistently outperforms all baseline methods, demonstrating the effectiveness of decoupling static semantic retrieval from dynamic state computation\. The results also verify that incorporating time\-aware state modeling improves the ability of recurrent memory\-based models to capture evolving user interests\.
Table 2:Overall performance comparison on public and industrial datasets\. The best results are highlighted in bold, and the second\-best results are underlined\. The performance gains of our method are statistically significant under a pairedtt\-test withp<0\.01p<0\.01\.
### 5\.3Ablation Study
As shown in Table[3](https://arxiv.org/html/2606.09888#S5.T3), we evaluate the contribution of each key component through four variants: \(1\) w/o TDGD, which replaces TDGD with vanilla Gated DeltaNet; \(2\) w/o Memory, which removes the conditional memory module and uses only TDGD; \(3\) w/o Time, which removes temporal decay and periodic temporal features; and \(4\) w/o Difference, which replaces the differential gateα\\alphawith a content\-only gate\.
From the ablation results, we draw three main observations\. First, TDGD improves performance by integrating temporal modeling with differential gating, enabling more effective state writing and reading\. Second, the memory module captures reusable local behavioral patterns and guides TDGD to focus on memory\-unexplained transitions\. Third, temporal features and differential readout are both essential: temporal signals make state updates time\-aware, while differential readout reduces memory\-aligned responses and prevents recurrent states from repeatedly absorbing redundant semantic patterns\.
Table 3:Ablation Study of SinkRec on Kuairec\.Figure 3:Online A/B experimental results\.
### 5\.4Online A/B Experimental Results
To evaluate the practical effectiveness of SinkRec, we conducted a 7\-day online A/B test on Kuaishou’s advertising platform with live production traffic\. The control and treatment groups are each allocated 10% of the total traffic for a fair comparison\.
As shown in Figure[3](https://arxiv.org/html/2606.09888#S5.F3), our method achieves consistent improvements over the control group, with average relative gains of 6\.5% in ADVV and 11\.1% in Revenue\. Both improvements are statistically significant under a two\-tailed test \(p < 0\.01\)\. The gains remain positive throughout the experimental period, demonstrating that the proposed method can effectively improve both advertiser\-side value and platform revenue in production environments\. These results validate the practical effectiveness, robustness, and scalability of our framework in large\-scale industrial advertising recommendation systems\.
## 6Conclusion
We present SinkRec, an efficient long\-sequence recommendation framework that addresses the semantic state sink phenomenon in recurrent linear attention\. By identifying that recurring semantic interests can be repeatedly absorbed into the recurrent state and interfere with current interest adaptation, we motivate a memory\-transition decoupled design that separates collaborative semantic lookup from dynamic transition computation\. SinkRec externalizes recurring interests into Conditional Semantic Memory and introduces Time\-aware Differential Gated DeltaNet to separate sink\-inducing collaborative semantics from recurrent state dynamics\. This reduces semantic dominance in recurrent states and improves sensitivity to recent interest shifts and target\-relevant behaviors\. Experiments on public and industrial datasets demonstrate that SinkRec improves recommendation effectiveness while preserving the efficiency of recurrent linear attention\.
## References
- Behrouz et al\. \[2026\]Ali Behrouz, Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni\.Memory caching: Rnns with growing memory\.*arXiv preprint arXiv:2602\.24281*, 2026\.
- Chai et al\. \[2025\]Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al\.Longer: Scaling up long sequence modeling in industrial recommenders\.In*Proceedings of the Nineteenth ACM Conference on Recommender Systems*, pages 247–256, 2025\.
- Chang et al\. \[2023\]Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al\.Twin: Two\-stage interest network for lifelong user behavior modeling in ctr prediction at kuaishou\.In*Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 3785–3794, 2023\.
- Chen et al\. \[2026\]Yixiao Chen, Yuan Wang, Yue Liu, Qiyao Wang, Ke Cheng, Xin Xu, Juntong Yan, Shuojin Yang, Menghao Guo, Jun Zhang, et al\.Recurrent preference memory for efficient long\-sequence generative recommendation\.*arXiv preprint arXiv:2602\.11605*, 2026\.
- Chen et al\. \[2025\]Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H Lee, Khushhall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, and Wen\-Yun Yang\.Massive memorization with hundreds of trillions of parameters for sequential transducer generative recommenders\.*arXiv preprint arXiv:2510\.22049*, 2025\.
- Cheng et al\. \[2026\]Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al\.Conditional memory via scalable lookup: A new axis of sparsity for large language models\.*arXiv preprint arXiv:2601\.07372*, 2026\.
- Fan et al\. \[2025\]Hao Fan, Mengyi Zhu, Yanrong Hu, Hailin Feng, Zhijie He, Hongjiu Liu, and Qingyang Liu\.Tim4rec: An efficient sequential recommendation model based on time\-aware structured state space duality model\.*Neurocomputing*, page 131270, 2025\.
- Gao et al\. \[2022\]Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat\-Seng Chua\.Kuairec: A fully\-observed dataset and insights for evaluating recommender systems\.In*Proceedings of the 31st ACM International Conference on Information & Knowledge Management*, pages 540–550, 2022\.
- Gao et al\. \[2026\]Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, et al\.Hysparse: A hybrid sparse attention architecture with oracle token selection and kv cache sharing\.*arXiv preprint arXiv:2602\.03560*, 2026\.
- Ha et al\. \[2026\]Mingming Ha, Guanchen Wang, Linxun Chen, Xuan Rao, Yuexin Shi, Tianbao Ma, Zhaojie Liu, Yunqian Fan, Zilong Lu, Yanan Niu, et al\.Unimixer: A unified architecture for scaling laws in recommendation systems\.*arXiv preprint arXiv:2604\.00590*, 2026\.
- Huang et al\. \[2024\]Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou\.Ultra\-sparse memory network\.*arXiv preprint arXiv:2411\.12364*, 2024\.
- Huang et al\. \[2025\]Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, et al\.Ultramemv2: Memory networks scaling to 120b parameters with superior long\-context learning\.*arXiv preprint arXiv:2508\.18756*, 2025\.
- Kang and McAuley \[2018\]Wang\-Cheng Kang and Julian McAuley\.Self\-attentive sequential recommendation\.In*2018 IEEE international conference on data mining \(ICDM\)*, pages 197–206\. IEEE, 2018\.
- Lai et al\. \[2026\]Weijiang Lai, Beihong Jin, Di Zhang, Siru Chen, Jiongyan Zhang, Yuhang Gou, Jian Dong, and Xingxing Wang\.Unleashing the potential of sparse attention on long\-term behaviors for ctr prediction\.In*Proceedings of the ACM Web Conference 2026*, pages 8041–8050, 2026\.
- Li et al\. \[2026\]Jingyu Li, Zhaocheng Du, Qianhui Zhu, Zhicheng Zhang, Song\-Li Wu, Chaolang Li, Pengwen Dai, et al\.Collectivekv: Decoupling and sharing collaborative information in sequential recommendation\.*arXiv preprint arXiv:2601\.19178*, 2026\.
- Liu et al\. \[2024a\]Chengkai Liu, Jianghao Lin, Jianling Wang, Hanzhou Liu, and James Caverlee\.Mamba4rec: Towards efficient sequential recommendation with selective state space models\.*arXiv preprint arXiv:2403\.03900*, 2024a\.
- Liu et al\. \[2024b\]Peng Liu, Nian Wang, Cong Xu, Ming Zhao, Bin Wang, and Yi Ren\.Dynamic user interest augmentation via stream clustering and memory networks in large\-scale recommender systems\.*arXiv preprint arXiv:2405\.13238*, 2024b\.
- Lu et al\. \[2025\]Hui Lu, Zheng Chai, Yuchao Zheng, Zhe Chen, Deping Xie, Peng Xu, Xun Zhou, and Di Wu\.Large memory network for recommendation\.In*Companion Proceedings of the ACM on Web Conference 2025*, pages 1162–1166, 2025\.
- Ma et al\. \[2026\]Mengyang Ma, Xiaopeng Li, Wanyu Wang, Zhaocheng Du, Jingtong Gao, Pengyue Jia, Yuyang Ye, Yiqi Wang, Yunpeng Weng, Weihong Luo, et al\.Blossomrec: Block\-level fused sparse attention mechanism for sequential recommendations\.In*Proceedings of the ACM Web Conference 2026*, pages 6389–6399, 2026\.
- Pi et al\. \[2019\]Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai\.Practice on long sequential user behavior modeling for click\-through rate prediction\.In*Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining*, pages 2671–2679, 2019\.
- Pi et al\. \[2020\]Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai\.Search\-based user interest modeling with lifelong sequential behavior data for click\-through rate prediction\.In*Proceedings of the 29th ACM International Conference on Information & Knowledge Management*, pages 2685–2692, 2020\.
- Pu et al\. \[2025\]Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, and Xiu Li\.Linear differential vision transformer: Learning visual contrasts via pairwise differentials\.*arXiv preprint arXiv:2511\.00833*, 2025\.
- Wang et al\. \[2023\]Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei\.Augmenting language models with long\-term memory\.*Advances in Neural Information Processing Systems*, 36:74530–74543, 2023\.
- Wu et al\. \[2026\]Shikang Wu, Hui Lu, Jinqiu Jin, Zheng Chai, Shiyong Hong, Junjie Zhang, Shanlei Mu, Kaiyuan Ma, Tianyi Liu, Yuchao Zheng, et al\.Msn: A memory\-based sparse activation scaling framework for large\-scale industrial recommendation\.*arXiv preprint arXiv:2602\.07526*, 2026\.
- Xin et al\. \[2026\]Lei Xin, Yuhao Zheng, Ke Cheng, Changjiang Jiang, Zifan Zhang, and Fanhu Zeng\.Hytrec: A hybrid temporal\-aware attention architecture for long behavior sequential recommendation\.*arXiv preprint arXiv:2602\.18283*, 2026\.
- Yang et al\. \[2026\]Ruochen Yang, Yueyang Liu, Zijie Zhuang, Changxin Lao, Yuhui Zhang, Jiangxia Cao, Jia Xu, Xiang Chen, Haoke Xiao, Xiangyu Wu, et al\.Sarm: Llm\-augmented semantic anchor for end\-to\-end live\-streaming ranking\.*arXiv preprint arXiv:2602\.09401*, 2026\.
- Ye et al\. \[2026\]Yufei Ye, Wei Guo, Hao Wang, Luankang Zhang, Heng Chang, Hong Zhu, Yuyang Ye, Yong Liu, Defu Lian, and Enhong Chen\.Fuxi\-linear: Unleashing the power of linear attention in long\-term time\-aware sequential recommendation\.*arXiv preprint arXiv:2602\.23671*, 2026\.
- Yuan et al\. \[2026\]Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao, Binbin Cao, Jian Wu, and Yuning Jiang\.Hisac: Hierarchical sparse activation compression for ultra\-long sequence modeling in recommenders\.*arXiv preprint arXiv:2602\.21009*, 2026\.
- Zhai et al\. \[2024\]Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al\.Actions speak louder than words: Trillion\-parameter sequential transducers for generative recommendations\.*arXiv preprint arXiv:2402\.17152*, 2024\.
- Zhao et al\. \[2023\]Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al\.A survey of large language models\.*arXiv preprint arXiv:2303\.18223*, 1\(2\):1–124, 2023\.
- Zhou et al\. \[2018\]Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai\.Deep interest network for click\-through rate prediction\.In*Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining*, pages 1059–1068, 2018\.
- Zhou et al\. \[2019\]Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai\.Deep interest evolution network for click\-through rate prediction\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 33, pages 5941–5948, 2019\.
- Zhou et al\. \[2026\]Yu Zhou, Chengcheng Guo, Kuo Cai, Ji Liu, Qiang Luo, Ruiming Tang, Han Li, Kun Gai, and Guorui Zhou\.Gems: Breaking the long\-sequence barrier in generative recommendation with a multi\-stream decoder\.*arXiv preprint arXiv:2602\.13631*, 2026\.
- Zhu et al\. \[2025\]Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al\.Rankmixer: Scaling up ranking models in industrial recommenders\.In*Proceedings of the 34th ACM International Conference on Information and Knowledge Management*, pages 6309–6316, 2025\.
- Zhuang et al\. \[2025\]Zhuang Zhuang, Haitao Yuan, Shanshan Feng, Heng Qi, Yanming Shen, and Baocai Yin\.Mgstdn: Multi\-granularity spatial\-temporal diffusion network for next poi recommendation\.In*Proceedings of the 34th ACM International Conference on Information and Knowledge Management*, pages 4560–4570, 2025\.
- Zhuang et al\. \[2026\]Zhuang Zhuang, Shanshan Feng, Hangwei Qian, Mingqi Yang, Heng Qi, Yanming Shen, and Baocai Yin\.Think2go: Generative next poi recommendation with llm reasoning\.In*Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1*, pages 2112–2123, 2026\.
## Appendix ATheoretical Analysis of Semantic State Sink
We provide a simplified derivation to explain why recurrent linear attention may suffer from semantic state sink in long\-sequence recommendation\. Consider the Gated DeltaNet update:
𝐒t=αt𝐒t−1\(𝐈−βt𝐤t𝐤t⊤\)\+βt𝐯t𝐤t⊤,\\mathbf\{S\}\_\{t\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\}\\left\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\\right\)\+\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\},\(25\)where𝐒t\\mathbf\{S\}\_\{t\}is the recurrent state,𝐤t\\mathbf\{k\}\_\{t\}and𝐯t\\mathbf\{v\}\_\{t\}are the key and value vectors, andαt,βt\\alpha\_\{t\},\\beta\_\{t\}control state retention and writing strength\. This update writes the value𝐯t\\mathbf\{v\}\_\{t\}into the state subspace addressed by𝐤t\\mathbf\{k\}\_\{t\}\. In long user sequences, recurring semantic behavior patterns often appear across distant positions and are expected to produce similar key representations\. To analyze how such recurrence affects a key\-addressed state subspace, we consider an idealized setting where these keys are aligned with a normalized direction𝐮\\mathbf\{u\}:
𝐤t=𝐮,‖𝐮‖2=1\.\\mathbf\{k\}\_\{t\}=\\mathbf\{u\},\\qquad\\\|\\mathbf\{u\}\\\|\_\{2\}=1\.\(26\)Let𝐳t=𝐒t𝐮\\mathbf\{z\}\_\{t\}=\\mathbf\{S\}\_\{t\}\\mathbf\{u\}denote the state content stored in this repeatedly accessed direction\. Right\-multiplying Eq\. \([25](https://arxiv.org/html/2606.09888#A1.E25)\) by𝐮\\mathbf\{u\}and using𝐮⊤𝐮=1\\mathbf\{u\}^\{\\top\}\\mathbf\{u\}=1yields:
𝐳t\\displaystyle\\mathbf\{z\}\_\{t\}=αt\(1−βt\)𝐳t−1\+βt𝐯t\.\\displaystyle=\\alpha\_\{t\}\(1\-\\beta\_\{t\}\)\\mathbf\{z\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\.\(27\)Thus, the repeatedly accessed semantic direction follows a first\-order state recurrence\. To make the address\-level dynamics analytically tractable, we consider a locally stationary setting where the retention and write strengths are time\-invariant, i\.e\.,αt=α\\alpha\_\{t\}=\\alphaandβt=β\\beta\_\{t\}=\\beta\. We define the effective retention factor asρ=α\(1−β\)\\rho=\\alpha\(1\-\\beta\), with0≤ρ<10\\leq\\rho<1\. Under this approximation, Eq\. \([27](https://arxiv.org/html/2606.09888#A1.E27)\) reduces to:
𝐳t=ρ𝐳t−1\+β𝐯t\.\\mathbf\{z\}\_\{t\}=\\rho\\mathbf\{z\}\_\{t\-1\}\+\\beta\\mathbf\{v\}\_\{t\}\.\(28\)This recurrence characterizes the evolution of the content stored in a repeatedly accessed semantic address: previous address\-specific content is retained byρ\\rho, while the current value is written with strengthβ\\beta\. For semantically recurrent behaviors, the written values usually share a stable semantic component\. We therefore decompose each value vector as:
𝐯t=𝐯¯\+ϵt,\\mathbf\{v\}\_\{t\}=\\bar\{\\mathbf\{v\}\}\+\\boldsymbol\{\\epsilon\}\_\{t\},\(29\)where𝐯¯\\bar\{\\mathbf\{v\}\}denotes the recurring semantic component andϵt\\boldsymbol\{\\epsilon\}\_\{t\}denotes instance\-specific residual variation\. Combining the recurrence with the semantic decomposition of𝐯t\\mathbf\{v\}\_\{t\}, we can express the address\-specific state as:
𝐳t=ρt𝐳0\+β\(1−ρt\)1−ρ𝐯¯⏟accumulated recurring semantics\+β∑i=1tρt−iϵi⏟residual variations\.\\mathbf\{z\}\_\{t\}=\\rho^\{t\}\\mathbf\{z\}\_\{0\}\+\\underbrace\{\\frac\{\\beta\(1\-\\rho^\{t\}\)\}\{1\-\\rho\}\\bar\{\\mathbf\{v\}\}\}\_\{\\text\{accumulated recurring semantics\}\}\+\\underbrace\{\\beta\\sum\_\{i=1\}^\{t\}\\rho^\{t\-i\}\\boldsymbol\{\\epsilon\}\_\{i\}\}\_\{\\text\{residual variations\}\}\.\(30\)This decomposition separates the address\-specific state into a decayed initialization term, an accumulated recurring semantic term, and a residual variation term\. Sinceρ<1\\rho<1, the state does not diverge; instead, as the same semantic address is repeatedly visited, the contribution of𝐯¯\\bar\{\\mathbf\{v\}\}approachesβ1−ρ𝐯¯\\frac\{\\beta\}\{1\-\\rho\}\\bar\{\\mathbf\{v\}\}, causing this address subspace to become biased toward the recurring semantic pattern\. Such an address\-level bias becomes harmful when the next behavior at timet\+1t\+1still activates the same semantic address𝐮\\mathbf\{u\}, while its value carries an additional residual transition component:
𝐯t\+1=𝐯¯\+𝐫t\+1,\\mathbf\{v\}\_\{t\+1\}=\\bar\{\\mathbf\{v\}\}\+\\mathbf\{r\}\_\{t\+1\},\(31\)where𝐫t\+1\\mathbf\{r\}\_\{t\+1\}denotes the new transition component\. Ignoring the residual variation and using the steady\-state approximation𝐳t≈β1−ρ𝐯¯\\mathbf\{z\}\_\{t\}\\approx\\frac\{\\beta\}\{1\-\\rho\}\\bar\{\\mathbf\{v\}\}, the next state content becomes:
𝐳t\+1\\displaystyle\\mathbf\{z\}\_\{t\+1\}=ρ𝐳t\+β\(𝐯¯\+𝐫t\+1\)\\displaystyle=\\rho\\mathbf\{z\}\_\{t\}\+\\beta\(\\bar\{\\mathbf\{v\}\}\+\\mathbf\{r\}\_\{t\+1\}\)\(32\)=β1−ρ𝐯¯⏟historical recurring semantics\+β𝐫t\+1⏟new transition signal\.\\displaystyle=\\underbrace\{\\frac\{\\beta\}\{1\-\\rho\}\\bar\{\\mathbf\{v\}\}\}\_\{\\text\{historical recurring semantics\}\}\+\\underbrace\{\\beta\\mathbf\{r\}\_\{t\+1\}\}\_\{\\text\{new transition signal\}\}\.Therefore, the new transition signal is mixed with an accumulated historical semantic component\. Their relative strength can be approximated by:
𝒟t\+1≈‖β1−ρ𝐯¯‖2‖β𝐫t\+1‖2=‖𝐯¯‖2\(1−ρ\)‖𝐫t\+1‖2\.\\mathcal\{D\}\_\{t\+1\}\\approx\\frac\{\\left\\\|\\frac\{\\beta\}\{1\-\\rho\}\\bar\{\\mathbf\{v\}\}\\right\\\|\_\{2\}\}\{\\left\\\|\\beta\\mathbf\{r\}\_\{t\+1\}\\right\\\|\_\{2\}\}=\\frac\{\\\|\\bar\{\\mathbf\{v\}\}\\\|\_\{2\}\}\{\(1\-\\rho\)\\\|\\mathbf\{r\}\_\{t\+1\}\\\|\_\{2\}\}\.\(33\)When the effective retentionρ\\rhois large, or when the new transition signal𝐫t\+1\\mathbf\{r\}\_\{t\+1\}is weak, the accumulated recurring semantics dominate the address\-specific state content\.
To connect this address\-level effect back to the Gated DeltaNet output, let𝐨t\+1\\mathbf\{o\}\_\{t\+1\}denote the state readout\. For a query𝐪t\+1\\mathbf\{q\}\_\{t\+1\}with a non\-negligible projection on the same semantic direction,
𝐪t\+1=ct\+1𝐮\+𝐪t\+1⟂,ct\+1=𝐮⊤𝐪t\+1\.\\mathbf\{q\}\_\{t\+1\}=c\_\{t\+1\}\\mathbf\{u\}\+\\mathbf\{q\}\_\{t\+1\}^\{\\perp\},\\qquad c\_\{t\+1\}=\\mathbf\{u\}^\{\\top\}\\mathbf\{q\}\_\{t\+1\}\.\(34\)The readout can then be decomposed as:
𝐨t\+1=𝐒t\+1𝐪t\+1=ct\+1𝐳t\+1\+𝐒t\+1𝐪t\+1⟂\.\\mathbf\{o\}\_\{t\+1\}=\\mathbf\{S\}\_\{t\+1\}\\mathbf\{q\}\_\{t\+1\}=c\_\{t\+1\}\\mathbf\{z\}\_\{t\+1\}\+\\mathbf\{S\}\_\{t\+1\}\\mathbf\{q\}\_\{t\+1\}^\{\\perp\}\.\(35\)Thus, whenct\+1c\_\{t\+1\}is non\-negligible, the accumulated recurring semantic component in𝐳t\+1\\mathbf\{z\}\_\{t\+1\}is directly propagated to the output𝐨t\+1\\mathbf\{o\}\_\{t\+1\}\. A large𝒟t\+1\\mathcal\{D\}\_\{t\+1\}therefore indicates that the readout along this semantic direction is dominated by historical recurring semantics, making the new transition signal relatively less salient\.
This phenomenon indicates that semantic state sink is not characterized by unbounded growth of the state norm\. Rather, recurrent updates induced by repeated behaviors gradually concentrate the state along frequently visited key\-addressed semantic subspaces\. Consequently, in long\-sequence recommendation, the recurrent state becomes biased toward dominant historical patterns, reducing its sensitivity to recent preference shifts and sparse but predictive transition signals\.
## Appendix BComparison with Vanilla Gated DeltaNet
Table[4](https://arxiv.org/html/2606.09888#A2.T4)summarizes the architectural differences between vanilla Gated DeltaNet and TDGD\. Vanilla Gated DeltaNet updates a compact recurrent state by writing the current key–value pattern and then reads the state with the current query\. This design is efficient, but its writing and reading operations are both driven only by the current input representation\. Therefore, when similar semantic behaviors repeatedly appear in a long sequence, vanilla Gated DeltaNet may continue to write these recurring patterns into the state and later retrieve them through query\-aligned directions, which can cause semantic state sinks\.
TDGD differs from vanilla Gated DeltaNet in two key aspects\. First, on the writing side, TDGD replaces the raw value𝐯t\\mathbf\{v\}\_\{t\}with the memory\-filtered value𝐯¯t\\bar\{\\mathbf\{v\}\}\_\{t\}and uses the time\-aware key𝐤~t\\tilde\{\\mathbf\{k\}\}\_\{t\}\. This makes state updates aware of whether the current behavior is already covered by retrieved semantic memory, thereby reducing redundant writes of recurring interests\. Second, on the reading side, TDGD augments the standard state readout𝐒t𝐪t\\mathbf\{S\}\_\{t\}\\mathbf\{q\}\_\{t\}with a state\-relation differential termηtratr𝐒t𝐤¯tm\\eta\_\{t\}^\{r\}a\_\{t\}^\{r\}\\mathbf\{S\}\_\{t\}\\bar\{\\mathbf\{k\}\}\_\{t\}^\{m\}\. This term suppresses the memory\-key\-aligned response from the same recurrent state, preventing memory\-dominated semantic directions from overwhelming the final readout\.
These two modifications target both the formation and the effect of semantic state sinks\. The memory\-conditioned write gate reduces redundant semantic accumulation before it enters the recurrent state, while the state\-relation differential readout suppresses memory\-aligned responses already encoded in the state\. This enables TDGD to preserve memory\-unexplained transition signals, making the model more responsive to recent interest shifts and infrequent but currently relevant behaviors in long sequence recommendation\.
Table 4:Comparison between vanilla Gated DeltaNet and TDGD\.
## Appendix CIllustration of Historical Influence Score
To visualize semantic state sink, we compute a historical influence score that measures how strongly each past behavior affects the current prediction through the recurrent state\. For modelf∈\{base,ours\}f\\in\\\{\\mathrm\{base\},\\mathrm\{ours\}\\\}, the state update at positionttis:
Δ𝐒tf=βtf\(𝐯tf−𝐒¯t−1f𝐤tf\)\(𝐤tf\)⊤,𝐒¯t−1f=αtf𝐒t−1f\.\\Delta\\mathbf\{S\}\_\{t\}^\{f\}=\\beta\_\{t\}^\{f\}\\left\(\\mathbf\{v\}\_\{t\}^\{f\}\-\\bar\{\\mathbf\{S\}\}\_\{t\-1\}^\{f\}\\mathbf\{k\}\_\{t\}^\{f\}\\right\)\(\\mathbf\{k\}\_\{t\}^\{f\}\)^\{\\top\},\\qquad\\bar\{\\mathbf\{S\}\}\_\{t\-1\}^\{f\}=\\alpha\_\{t\}^\{f\}\\mathbf\{S\}\_\{t\-1\}^\{f\}\.\(36\)Given the current prediction positionTT, we define:
Itf=‖Γt→TfΔ𝐒tf𝐪Tf‖2,I\_\{t\}^\{f\}=\\left\\\|\\Gamma\_\{t\\rightarrow T\}^\{f\}\\Delta\\mathbf\{S\}\_\{t\}^\{f\}\\mathbf\{q\}\_\{T\}^\{f\}\\right\\\|\_\{2\},\(37\)whereΓt→Tf\\Gamma\_\{t\\rightarrow T\}^\{f\}denotes the accumulated retention fromtttoTT, and𝐪Tf\\mathbf\{q\}\_\{T\}^\{f\}is the current readout query\. Thus,ItfI\_\{t\}^\{f\}measures the readable influence of thett\-th historical behavior on the current prediction\.
For the case visualization in Figure[1](https://arxiv.org/html/2606.09888#S1.F1), we rescale the scores with a shared maximum over the compared models:
I~tf=Itfmaxg∈\{base,ours\},iIig\+ϵ\.\\tilde\{I\}\_\{t\}^\{f\}=\\frac\{I\_\{t\}^\{f\}\}\{\\max\\limits\_\{g\\in\\\{\\mathrm\{base\},\\mathrm\{ours\}\\\},\\,i\}I\_\{i\}^\{g\}\+\\epsilon\}\.\(38\)This shared scaling preserves the relative magnitude between the vanilla model and SinkRec\. Higher values indicate that the corresponding historical behavior occupies a stronger role in the current recurrent readout\. Semantic state sink is indicated when semantically repetitive but target\-irrelevant behaviors obtain disproportionately high influence scores and dominate the prediction\.
## Appendix DExperiment Settings, Metrics, and Model Scale
Experiment settings\.For data preprocessing, we follow the same settings as HSTU and FuXi\-Linear for all compared models to ensure fair comparison\. Beyond public datasets, we further validate our framework on a large\-scale industrial dataset derived from real\-world impression logs of Kuaishou’s advertising platform\. Compared with standard public benchmarks, this dataset provides a more realistic evaluation environment characterized by large\-scale traffic, diverse user behaviors, and practical commercial recommendation constraints\. The results demonstrate the scalability and robustness of our method in real\-world industrial deployment scenarios\.
Training is conducted on 2 A800 GPUs, with all settings following Fuxi\-Linear\. Instead of stacking distinct layers, SinkRec iteratively reuses a shared hybrid memory\-transition block, resulting in a substantially smaller parameter budget\.
Metrics\.We adopt three widely used ranking metrics for evaluation: top\-KKHit Ratio \(HR@KK, denoted as R@KKin tables\), Normalized Discounted Cumulative Gain \(NDCG@KK, denoted as N@KKin tables\), and Mean Reciprocal Rank \(MRR\)\. Higher values indicate better recommendation performance for all metrics\. Unless otherwise specified, we rank the ground\-truth item against the full item set and report results atK=10K=10andK=50K=50\. For the online A/B experiments, we use Advertiser Value \(ADVV\) and Revenue as core evaluation metrics\.
Model scale\.For KuaiRec and ML\-20M, we set the number of layers to 4 and 8, and the embedding dimension to 128 and 256, respectively\. To ensure a fair comparison, we tune the hidden dimensions of all baselines to keep their model scales comparable\. The detailed hyperparameter settings and the numbers of non\-embedding parameters are reported in Table[5](https://arxiv.org/html/2606.09888#A4.T5)\. These results also show that SinkRec achieves competitive performance with a more lightweight architecture\.
It is worth noting that SinkRec shares parameters across hybrid blocks, which substantially reduces the overall parameter count without increasing inference latency\.
Table 5:The sizes of models across various datasets\.Figure 4:TDGD alleviates semantic state sink\.\(a\) In the base model, normalized state\-response mass increasingly concentrates on repetitive sink\-prone directions as the prefix grows\. \(b\) TDGD keeps the response mass more balanced across repetitive and remaining directions\. \(c\) The base state\-probe response map‖𝐒t𝐤jm‖2\\\|\\mathbf\{S\}\_\{t\}\\mathbf\{k\}\_\{j\}^\{m\}\\\|\_\{2\}shows persistent bright bands, indicating that a few repetitive prototypes dominate the recurrent state\. \(d\) TDGD suppresses such directional concentration while preserving structured semantic responses\. For \(c\) and \(d\), prototype indices are sorted by repetition frequency from high to low, withj=1j=1andj=96j=96corresponding to the most and least frequent prototypes, respectively\.
## Appendix EMechanism Analysis: From Semantic Explanation to State Innovation
#### Visualization of semantic state sink\.
Figure[4](https://arxiv.org/html/2606.09888#A4.F4)analyzes semantic state sink by probing recurrent states with a shared semantic basis\. Specifically, we use the learned memory codewords as semantic probe directions and apply the same probes to both the base model and SinkRec\. Although the base model does not contain the memory module, its recurrent state can still be projected onto these codeword directions to examine whether it concentrates on repetitive semantics\.
Given the recurrent state𝐒t\\mathbf\{S\}\_\{t\}and the projected memory\-key direction𝐤jm\\mathbf\{k\}\_\{j\}^\{m\}associated with thejj\-th codeword, we define the state\-probe response as
et,j=‖𝐒t𝐤jm‖2\.e\_\{t,j\}=\\left\\\|\\mathbf\{S\}\_\{t\}\\mathbf\{k\}\_\{j\}^\{m\}\\right\\\|\_\{2\}\.\(39\)A largeret,je\_\{t,j\}indicates that the recurrent state is more strongly readable along the semantic direction represented by thejj\-th codeword\.
Let𝒥rep\\mathcal\{J\}\_\{\\mathrm\{rep\}\}denote the top\-KKcodeword directions with the highest usage frequency, which represent repetitive behavioral semantics\. We measure the response\-energy mass concentrated on these repetitive directions as
mtrep=∑j∈𝒥repet,j2∑j=1Jet,j2\+ϵ,mtrem=1−mtrep,m\_\{t\}^\{\\mathrm\{rep\}\}=\\frac\{\\sum\_\{j\\in\\mathcal\{J\}\_\{\\mathrm\{rep\}\}\}e\_\{t,j\}^\{2\}\}\{\\sum\_\{j=1\}^\{J\}e\_\{t,j\}^\{2\}\+\\epsilon\},\\qquad m\_\{t\}^\{\\mathrm\{rem\}\}=1\-m\_\{t\}^\{\\mathrm\{rep\}\},\(40\)wheremtrepm\_\{t\}^\{\\mathrm\{rep\}\}quantifies how much state\-probe energy is assigned to frequently reused semantic directions\.
Figure[4](https://arxiv.org/html/2606.09888#A4.F4)\(a\) shows that the base model gradually shifts more response mass toward high\-frequency codeword directions as the prefix grows\. This indicates that its recurrent state becomes increasingly concentrated on repetitive semantics\. Figure[4](https://arxiv.org/html/2606.09888#A4.F4)\(b\) shows that TDGD reduces this excessive concentration and maintains a more balanced response distribution, suggesting that memory\-guided writing and reading alleviate semantic state sink\.
Figures[4](https://arxiv.org/html/2606.09888#A4.F4)\(c\) and \(d\) further visualize the full state\-probe response map‖𝐒t𝐤jm‖2\\\|\\mathbf\{S\}\_\{t\}\\mathbf\{k\}\_\{j\}^\{m\}\\\|\_\{2\}, where codeword directions are sorted by usage frequency\. In the base model, a few high\-frequency directions form persistent bright horizontal bands, especially at later prefixes, showing that repetitive semantics dominate the recurrent state response\. In contrast, TDGD yields more distributed yet still structured responses across semantic directions, suppressing excessive dominance by repetitive patterns while preserving meaningful semantic organization\.
Figure 5:The effectiveness comparison across different sequence lengths\.
## Appendix FSequence Length Scaling Experiment
To evaluate the scalability of SinkRec with respect to sequence length, we conduct comparative experiments on the KuaiRec dataset\. We use FuXi\-Linear, the strongest baseline, as the reference model and follow its reported hyperparameter settings for fair comparison\. We gradually increase the input sequence length from 256 to 4096\. As shown in Figure[5](https://arxiv.org/html/2606.09888#A5.F5), both models benefit from longer histories, but SinkRec consistently outperforms FuXi\-Linear across all sequence lengths\. Specifically, SinkRec achieves relative MRR improvements of 8\.53%, 8\.86%, 9\.20%, 8\.58%, and 9\.21% at sequence lengths 256, 512, 1024, 2048, and 4096, respectively\. Moreover, the performance gap remains stable and even reaches the largest relative gain at length 4096, indicating that SinkRec scales more effectively to longer histories\. This demonstrates that the proposed memory\-transition decoupled design can better exploit extended user sequences while maintaining robust architectural scalability\.
## Appendix GComputational Cost
To further assess the computational efficiency of the proposed architecture, we report the per\-batch inference time of all compared models under identical hardware settings\. For a fair comparison of the core sequential modeling component, we remove the external Memory module from our method in this efficiency evaluation and measure the inference latency of the remaining backbone\. As shown in Table[6](https://arxiv.org/html/2606.09888#A7.T6), our method achieves the fastest inference speed, requiring only 44\.578 ms per batch\. This is substantially lower than CollectiveKV \(93\.578 ms\), HSTU \(80\.491 ms\), and BlossomRec \(80\.368 ms\)\. Notably, our method also outperforms the lightweight linear baseline FuXi\-Linear, reducing the inference latency from 61\.263 ms to 44\.578 ms\.
Table 6:Inference time comparison on the same NVIDIA A800 GPUs\.The relatively high cost of CollectiveKV mainly comes from its cross\-user collective attention mechanism, while HSTU and BlossomRec introduce additional computation through hierarchical modeling or multi\-granularity feature interactions\. In contrast, after removing the external Memory module, our backbone still maintains an efficient recurrent/linear sequential modeling structure and avoids expensive quadratic attention or cross\-user aggregation during inference\. These results demonstrate that the core design of our method is computationally efficient and suitable for large\-scale long\-sequence recommendation scenarios with strict online latency constraints\.
## Appendix HDiscussions
#### Semantic state sink in recurrent linear attention\.
Recurrent linear attention improves efficiency by compressing long histories into recurrent states\. However, the same state must preserve collaborative semantic patterns and compute dynamic transitions\. In long user histories, semantically recurrent local patterns can be repeatedly written into the state, causing high\-frequency interests to dominate the representation\. Thissemantic state sinkweakens the model’s sensitivity to current local behaviors and explains why longer histories do not always yield proportional gains\.
#### A unified perspective on memory\-transition decoupling\.
SinkRec views long\-sequence recommendation as a memory\-transition decoupling problem\. Existing methods often couple semantic pattern retrieval and transition modeling within attention weights, retrieved subsequences, memory slots, or recurrent states\. In contrast, SinkRec assigns collaborative semantic patterns to Conditional Semantic Memory and leaves memory\-unexplained transition residuals to the recurrent state\. This separation reduces redundant semantic accumulation and preserves state capacity for dynamic interest adaptation\.
#### Semantic initialization of memory codebooks\.
One future direction is to enrich the memory codebooks with item\-side semantic knowledge\. While the current residual VQ codebooks are learned end\-to\-end from behavior windows and mainly capture collaborative recurring patterns, recommendation items often contain category, tag, title, or description information\. These textual attributes can be encoded by pretrained LLM embedding models to initialize semantic codebooks\. By freezing the initialized codebook vectors during recommendation training, the memory module can preserve a stable external semantic space and avoid being fully collapsed into task\-specific ID correlations\. The trainable window encoder and projection adapters can then learn to retrieve and use these fixed semantic anchors, while the recurrent state focuses on memory\-unexplained transitions\. Such frozen semantic codebooks may better capture the latent intent behind contiguous and dilated behavior windows, improving interpretability, cold\-start robustness, and generalization to sparse behavioral patterns\.Similar Articles
Δ-Mem: Efficient Online Memory for Large Language Models
Proposes delta-Mem, a lightweight online memory mechanism that uses a compact state matrix updated by delta-rule learning to improve long-context performance of frozen LLMs without full fine-tuning or context extension.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
Generic Triple-Latent Compression with Gated Associative Retrieval
This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.
MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
MemDreamer decouples perception and reasoning for long video understanding using hierarchical graph memory and agentic retrieval, achieving state-of-the-art performance with reduced computational overhead.