Implicit Reasoning for Large Language Model-based Generative Recommendation
Summary
This paper proposes PauseRec, a lightweight implicit reasoning paradigm for LLM-based generative recommendation that outperforms explicit chain-of-thought methods while significantly reducing training and inference costs.
View Cached Full Text
Cached at: 06/15/26, 08:57 AM
# Implicit Reasoning for Large Language Model-based Generative Recommendation
Source: [https://arxiv.org/html/2606.14142](https://arxiv.org/html/2606.14142)
###### Abstract
Large Language Models \(LLMs\) are increasingly adopted as backbones for Generative Recommendation \(GR\), promising access to pretrained world knowledge\. Yet reliably invoking this knowledge for GR remains poorly understood\. A key obstacle is that LLM\-based GR typically represents items with Semantic IDs \(SIDs\), disrupting LLMs’ natural\-language reasoning interface because these tokens are unseen by the LLM during pretraining\. Existing approaches address this with expensive multi\-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary\. In this work, we systematically decompose explicit reasoning training pipelines for LLM\-based GR, revealing three key limitations: weakened world\-knowledge verbalization, misalignment between SID and natural\-language token embedding spaces, and sensitivity to rationale quality—all of which hurt explicit reasoning performance\. To circumvent these issues, we proposePauseRec, a lightweight implicit reasoning paradigm tailored for GR\.PauseRecis exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: \(1\) it outperforms standard explicit CoT methods by up to 6\.22%, \(2\) it reduces training cost by up to65%GPU hours, and \(3\) it speeds up inference by up to71\.3%\. These results positionPauseRecas a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM\-based GR111Work done when Yinhan He was a Research Intern at Snap Inc\.\.\.
Implicit Reasoning for Large Language Model\-based Generative Recommendation
## 1Introduction
Large Language Models \(LLMs\) have recently been adopted as backbones for Generative Recommendation \(GR\), enabling LLM\-based GR systems that formulate recommendation as conditional generation: an LLM reads a user history and generates the next item\(Huaet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib11); Baoet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib12); Rajputet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib10)\)\. The appeal of LLMs for GR lies in their pretrained world knowledgeZhaoet al\.\([2023](https://arxiv.org/html/2606.14142#bib.bib37)\); Huang and Chang \([2023](https://arxiv.org/html/2606.14142#bib.bib30)\); Yuet al\.\([2024](https://arxiv.org/html/2606.14142#bib.bib31)\)\. In principle, this knowledge can help infer semantic relationships among historical items, identify a user’s latent intent, and map that intent to plausible next items beyond memorized co\-occurrences\(Wanget al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib15)\)\. Yet the process of efficiently and effectively accessing LLMs’ pretrained knowledge for GR remains poorly understoodZhanget al\.\([2026](https://arxiv.org/html/2606.14142#bib.bib36)\)\.
A key obstacle to leveraging an LLM’s world knowledge for GR centers on item representation\. Specifically, LLM\-based GR systems typically represent items with Semantic IDs \(SIDs\), i\.e\., short sequences of special tokens derived from items’ semantic relations\(Rajputet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib42)\)\. SIDs make item generation tractable given their compactness, but they are not natural\-language expressions and reside outside the pretrained LLM vocabulary\(Liet al\.,[2021](https://arxiv.org/html/2606.14142#bib.bib17)\)\. This creates a mismatch: LLMs access world knowledge through natural language, while the recommendation task is to generate a non\-linguistic SID conditioned on other non\-linguistic SIDs\. We therefore ask:How can pretrained LLM world knowledge be effectively leveraged to improve recommendation over SID tokens?
Following the broader LLM literatureYuet al\.\([2024](https://arxiv.org/html/2606.14142#bib.bib31)\); Petroniet al\.\([2019](https://arxiv.org/html/2606.14142#bib.bib34)\), one natural answer to this question is leveraging explicit Chain\-of\-Thought \(CoT\) reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.14142#bib.bib29); Kojimaet al\.,[2022](https://arxiv.org/html/2606.14142#bib.bib18)\)222In this work, we use both “reasoning” and “rationales” to refer to LLMs’ Chain\-of\-Thought \(CoT\) process, i\.e\., the intermediate, step\-by\-step traces that LLMs generate before producing a final answerWeiet al\.\([2022](https://arxiv.org/html/2606.14142#bib.bib29)\)\.\. Explicit CoT has been shown to improve LLM performance on a range of knowledge\-intensive domains, including mathematicsImaniet al\.\([2023](https://arxiv.org/html/2606.14142#bib.bib35)\), scienceTruhnet al\.\([2023](https://arxiv.org/html/2606.14142#bib.bib50)\), and codingJianget al\.\([2026](https://arxiv.org/html/2606.14142#bib.bib44)\)\. For LLM\-based GR, previous methods have pursued a similar goal through multi\-step training pipelines\. These pipelines typically ground LLMs in SIDs via continual pretraining \(CPT\) on natural\-language item descriptions, optimize next\-item prediction with supervised finetuning \(SFT\), elicit explicit rationales through SFT over reasoning trajectories \(which we refer to as CoT SFT\) , and refine model responses with reinforcement learning \(RL\) post\-training\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40); Yuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib38); Lianget al\.,[2026](https://arxiv.org/html/2606.14142#bib.bib39)\)\. Yet existing work provides limited insight into when these stages are necessary and why they help\. Given each stage’s high computation cost, understanding these questions is critical to both justifying the full workflow and identifying more efficient alternatives\.
Figure 1:The three identified limitations for explicit CoT in SID\-based GR\. CoT SFT weakens world\-knowledge verbalization \(left\), separates natural\-language and SID embedding spaces \(middle\), and makes recommendation quality sensitive to rationale perturbations \(right\), motivating an implicit alternative to verbal rationales\.To address this gap, we first analyze explicit reasoning pipelines for LLM\-based GR, examining each stage’s contribution and necessity\. We begin with CPT stage, finding that CPT\-trained models can recover coarse item categories but often struggle to identify titles or fine\-grained categories, indicating that grounding provides a real but incomplete semantic signal\. We then test whether CoT SFT with various reasoning formats, including template\-based reasoning and teacher\-generated reasoning, can improve recommendation performance\. Across these variants, CoT SFT consistentlyunderperformssimple next\-item SFT\. Performance gains from explicit CoT emerge only after expensive RL post\-training\.
To explain this discrepancy, we identify three limitations of explicit reasoning\. First, we find that CoT SFT makes pretrained world knowledge harder to verbalize under standard decoding, even though this knowledge remains recoverable from the model’s logits\. Second, we show that text and SID token embeddings become geometrically separated during training\. Our theoretical analysis proves that this separation limits the extent to which reasoning expressed in natural\-language tokens can shape the final SID prediction\. Third, we demonstrate that recommendation performance is sensitive to superficial perturbations of the ground\-truth rationales\. Together, these findings suggest that explicit rationales are a brittle interface for exploiting LLM knowledge in LLM\-based GR\.
To circumvent the aforementioned challenges, we proposePauseRec, a lightweightimplicitreasoning framework for SID\-based GR\. Instead of crafting ground\-truth natural\-language rationales with expensive teacher models and training the model to generate those rationales,PauseRecinserts a short sequence of trainable<pause\>tokens before SID generation\. The<pause\>token is initialized and pretrained to connect language and SID representations, then optimized only through the final next\-item prediction objective, giving the model latent computation steps that directly shape SID prediction\.PauseRecaddresses the three issues of explicit reasoning pipelines by \(i\) removing reliance on verbalizing pretrained knowledge, \(ii\) bridging the text\-SID representation gap through a trainable<pause\>token, and \(iii\) avoiding brittle rationale supervision\. On multiple Amazon review datasets,PauseRecoutperforms SFT and CoT\-based methods by up to 6\.22%, while substantially simplifying explicit reasoning pipelines; it reduces training cost by up to65%GPU hours and speeds up inference by71\.3%, positioning implicit reasoning as a stronger and more efficient alternative for LLM\-based GR\. Our contributions are as follows:
- •Diagnostic analysis\.We decompose explicit reasoning pipelines for LLM\-based GR and identify why they fail without RL post\-training, including incomplete SID grounding, weakened world\-knowledge verbalization, text–SID embeddings mismatch, and sensitivity to rationale formats\.
- •Implicit reasoning framework\.We introduce a novel pipeline termedPauseRec, which uses trainable<pause\>tokens to elicit latent reasoning without rationale supervision\.
- •Empirical evaluation\.Across three Amazon review datasets,PauseRecimproves over standard SFT and CoT\-based methods by up to 6\.22% while reducing training and inference overhead\.
## 2Preliminaries
### 2\.1Problem Formulation
Following the GR literatureLiuet al\.\([2025](https://arxiv.org/html/2606.14142#bib.bib40)\), we consider the sequential recommendation task\. Letℐ\\mathcal\{I\}denote the set of all items\. Given a user’snnchronologically ordered interaction historyH=\[i1,i2,…,in\]H=\[i\_\{1\},i\_\{2\},\\ldots,i\_\{n\}\]whereij∈ℐi\_\{j\}\\in\\mathcal\{I\}, the task is to predict the next itemin\+1i\_\{n\+1\}that the user will interact with\. Following recent work\(Rajputet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib10); Baoet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib12)\), LLM\-based GR represents each itemi∈ℐi\\in\\mathcal\{I\}with a Semantic ID \(SID\), i\.e\., a sequence of tokenssi=\[si\(1\),si\(2\),…,si\(L\)\]s\_\{i\}=\[s\_\{i\}^\{\(1\)\},s\_\{i\}^\{\(2\)\},\\ldots,s\_\{i\}^\{\(L\)\}\]of lengthLLthat are added to the LLM’s vocabulary\. Recommendation can then be framed as conditional generation:
p\(in\+1\|H\)=p\(sin\+1\|Prompt\(H\)\)p\(i\_\{n\+1\}\|H\)=p\(s\_\{i\_\{n\+1\}\}\|\\text\{Prompt\}\(H\)\)\(1\)wherePrompt\(H\)\\text\{Prompt\}\(H\)converts the interaction history into a natural\-language prompt listing past items \(and optionally metadata\)\. All methods in this paper share this generative formulation; they differ in how reasoning is inserted before SID prediction\.
### 2\.2Existing Explicit CoT Pipelines for GR
We introduce the multiple training stages of existing explicit reasoning pipelinesLiuet al\.\([2025](https://arxiv.org/html/2606.14142#bib.bib40)\); Lianget al\.\([2026](https://arxiv.org/html/2606.14142#bib.bib39)\)for GR as follows:
Continual Pretraining \(CPT\)\.The LLM is finetuned on an interleaved corpus of SIDs and item descriptions, with only SID token embeddings trainable\. This stage grounds item semantics into SID token embeddings\. Given itemiiwith descriptiondid\_\{i\}, the model is trained on:
ℒCPT=−𝔼\(si,di\)\[logp\(si\|di\)\+logp\(di\|si\)\]\\mathcal\{L\}\_\{\\text\{CPT\}\}=\-\\mathbb\{E\}\_\{\(s\_\{i\},d\_\{i\}\)\}\\left\[\\log p\(s\_\{i\}\|d\_\{i\}\)\+\\log p\(d\_\{i\}\|s\_\{i\}\)\\right\]\(2\)
Next\-item Supervised finetuning \(SFT\)\.The CPT model is finetuned on user\-items interaction histories to predict the next item by generating its SID:
ℒSFT=−𝔼\(H,in\+1\)\[logp\(sin\+1\|Prompt\(H\)\)\]\\mathcal\{L\}\_\{\\text\{SFT\}\}=\-\\mathbb\{E\}\_\{\(H,i\_\{n\+1\}\)\}\\left\[\\log p\(s\_\{i\_\{n\+1\}\}\|\\text\{Prompt\}\(H\)\)\\right\]\(3\)
CoT SFT\.After SFT, the model is finetuned to generate natural\-language rationales before the target SID\. The training objective pairs each historyHH, rationalerr, and next itemin\+1i\_\{n\+1\}as
ℒReasoning=\\displaystyle\\mathcal\{L\}\_\{\\text\{Reasoning\}\}=\(4\)−𝔼\(H,r,in\+1\)\[logp\(r,sin\+1\|Prompt\(H\)\)\]\\displaystyle\-\\mathbb\{E\}\_\{\(H,r,i\_\{n\+1\}\)\}\\left\[\\log p\(r,s\_\{i\_\{n\+1\}\}\|\\text\{Prompt\}\(H\)\)\\right\]Here, rationales are method\-specific: someLiuet al\.\([2025](https://arxiv.org/html/2606.14142#bib.bib40)\)use reasoning templates, while othersLianget al\.\([2026](https://arxiv.org/html/2606.14142#bib.bib39)\)utilize a teacher LLM\.
Reinforcement Learning \(RL\) Post\-training\.Existing methods further apply RL to optimize recommendation rewards directly\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40); Yuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib38); Lianget al\.,[2026](https://arxiv.org/html/2606.14142#bib.bib39)\), though this stage is computationally expensive\.
## 3Contributions of the Training Stages
Given the current gap in understanding when and why different training stages make CoT effective for GR, we analyze the role of each stage in Section[2\.2](https://arxiv.org/html/2606.14142#S2.SS2)\. We focus on CPT and CoT SFT here; next\-item SFT and RL are evaluated in Section[6](https://arxiv.org/html/2606.14142#S6)\.
### 3\.1CPT: Can LLMs Recover SID Semantics?
The primary aim of CPT is to ground SID semantics in LLMs, based on the premise that LLMs can reason over SIDs only after understanding their semantics\. Before examining reasoning\-related stages, we ask how much item\-level semantic information an LLM recovers from SIDs after CPT\.
Experimental Design\.We train a Qwen3\-1\.7BTeam \([2025](https://arxiv.org/html/2606.14142#bib.bib3)\)backbone on Amazon BeautyNiet al\.\([2019](https://arxiv.org/html/2606.14142#bib.bib25)\)with CPT for 2 epochs, where each SID is paired with its name and category during training\. After CPT, we test whether the model can generate \(1\)item titlesand \(2\)item categories333In Amazon BeautyNiet al\.\([2019](https://arxiv.org/html/2606.14142#bib.bib25)\), categories are three\-level paths, e\.g\., “Beauty ¿ Hair Care ¿ Conditioners\.”at 1\-, 2\-, and 3\-level granularity\. We prompt with each test SID and measure exact\-match accuracy; prompts and decoding are in Appendix[F\.3](https://arxiv.org/html/2606.14142#A6.SS3)\.
Results and Analysis\.
Table 1:SID metadata recovery after CPT\. The model recovers coarse one\-level categories almost perfectly, but fails on item titles and fine\-grained categories, showing that SID grounding provides partial semantic information rather than precise item\-level understanding\.From Table[1](https://arxiv.org/html/2606.14142#S3.T1), we observe that \(1\)Fine\-grained understanding is poor:title recovery stays near 0% and full\-category accuracy remains below 7\.2% on all datasets, so item\-level semantics are largely unrecovered\. \(2\)Coarse category signal is strong:1\-level category accuracy reaches highest 99\.6% and 2\-level accuracy up to 40\.5%, indicating that CPT captures broad categorical structure\. These results show that LLMs associate SIDs with semantics from pretraining, but only at a coarse level\. We next test whether CoT can convert this signal into better SID prediction\.
### 3\.2CoT SFT: The Failure of Explicit CoT
Here, we investigate if CoT SFT improves GR\.
Experimental Design\.We perform CoT SFT on Qwen3\-1\.7BTeam \([2025](https://arxiv.org/html/2606.14142#bib.bib3)\)after CPT and SFT on Amazon BeautyNiet al\.\([2019](https://arxiv.org/html/2606.14142#bib.bib25)\), using template\-based, teacher\-generated, rejection\-sampled, and format\-restricted rationales\. For template\-based rationales, we use: \(1\)Template\-Category: “The user is likely to buy items in the \{target item category\} category\.” and \(2\)Template\-Extended: “The user demonstrates interest in \{frequent categories\} products\. By identifying the user’s preference in \{characteristics\}, we can predict the user’s purchase of \{target categories\}, for example, a \{item title\}\.” For teacher\-generated rationales, we use Gemini 3\.1 Flash\-Lite and ProTeamet al\.\([2023](https://arxiv.org/html/2606.14142#bib.bib45)\)to producefree\-formtraces\. For rejection sampling, Gemini 3\.1 Flash\-Lite generates multiple traces per sample, and we select either the trace with the highest target\-SID logits \(Gemini 3\.1 Flash\-Lite Rejection\) or the trace Gemini 3\.1 Pro judges to best connect the user history to the target item \(Gemini 3\.1 FL Gemini Rejection\)\. Finally,format\-restrictedtraces impose reasoning constraints, e\.g\., rationales must reference SIDs\. Sample rationales and prompts are in Appendix[F](https://arxiv.org/html/2606.14142#A6)\.
Results and Analysis\.
Table 2:CoT SFT variants on Amazon Beauty\. Explicit rationales fail to outperform simple next\-item SFT across rationale variants, indicating that rationale supervision alone does not reliably improve GR\.Table[2](https://arxiv.org/html/2606.14142#S3.T2)shows that explicit CoT variants underperform next\-item SFT\. The strongest variant, Gemini rejection sampling, remains below the baseline \(0\.0524 vs\. 0\.0533 Hit@5\), while weaker teacher\-generated variants lose over 20% relative Hit@5, so CoT SFT alone does not reliably improve SID prediction\. This pattern contrasts with CoT’s success on language tasks: prior LLM\-based GR work that reports gains from in\-text reasoning relies on expensive RL with verifiable rewards \(RLVR\) after CoT SFT\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40); Yuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib38)\)\. While RLVR can recover performance, it requires multiple rollout trajectories per step and is substantially more expensive than next\-item SFT, which raises the question:why does CoT SFT fail for SID\-based GR?
## 4Diagnosis of CoT SFT Limitations
To understand why explicit CoT SFT fails, we conduct diagnostic studies and identify three limitations of the CoT SFT stage\.
### 4\.1Difficulty Verbalizing World Knowledge
Finding\.CoT SFT does not erase LLMs’ world knowledge, but makes them difficult to verbalize\.
Experimental Design\.We evaluate Qwen3\-1\.7BTeam \([2025](https://arxiv.org/html/2606.14142#bib.bib3)\)after CoT SFT on representative language tasks benchmarks MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.14142#bib.bib26)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.14142#bib.bib27)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.14142#bib.bib28)\), and ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.14142#bib.bib43)\)in multiple\-choice format\. We reporttext\-matchaccuracy \(exact A/B/C/D generation\) andlogit\-basedaccuracy \(whether the correct choice has the highest logit\)\.
Table 3:General\-language reasoning accuracy after recommendation CoT SFT\. Text\-match accuracy collapses while logit\-based accuracy remains close to the base model, suggesting that answer information is still present in logits but is no longer reliably verbalized\.Results and Analysis\.Table[3](https://arxiv.org/html/2606.14142#S4.T3)shows text\-match accuracy degrades after CoT SFT, while logit\-based accuracy remains close to the base model on all benchmarks\. It shows that the LLM’s world knowledge remains primarily in logit space and is hard to verbalize in explicit natural language text format\. We next examine whether this text–SID interface mismatch is also in the token embedding space\.
### 4\.2Text–SID Embedding Misalignment
Finding\.SID and natural\-language tokens become geometrically separated in the token embedding space, causing difficulty for LLM to unify text and SIDs under a coherent rationale \(see results and analysis for specific reasons\)\.
Experimental Design\.We visualize token embeddings after SID initialization, CPT, SFT, and CoT SFT using PCAJolliffe \([2025](https://arxiv.org/html/2606.14142#bib.bib46)\), comparing ordinary text tokens with SID tokens\.
Figure 2:PCA of text and SID token embeddings across training stages\. SID tokens drift away from ordinary text tokens as training progresses, indicating limited embedding space overlap between text and SIDs\.Results and Analysis\.Figure[2](https://arxiv.org/html/2606.14142#S4.F2)shows text and SID embeddings diverging across stages, and the gap is already pronounced after CPT and continue to slightly expand during SFT and CoT SFT\. This token embedding discrepancy suggests difficulty in unifying language and SIDs in one coherent representation\. Specifically, Appendix[D](https://arxiv.org/html/2606.14142#A4)proves that when text\- and SID\-induced hidden\-state directions are weakly coupled, updates driven by natural\-language rationales can only weakly shift the logits over SID tokens, so explicit CoT has limited leverage on the final recommendation\.
### 4\.3Performance Fragility w\.r\.t\. Rationales
Finding\.After CoT SFT, recommendation performance is highly sensitive to the rationale text at inference, even when generated reasoning only slightly deviates from the ground\-truth rationale\.
Experimental Design\.We test CoT SFT models with ground\-truth rationales and controlled perturbations—removing the target item category, randomly dropping five words, or randomly adding five noise words—and measure Hit@5 and NDCG@5 under each setting\.
Table 4:Rationale perturbation sensitivity on Amazon BeautyNiet al\.\([2019](https://arxiv.org/html/2606.14142#bib.bib25)\)\. Removing the target category more than halves performance, and even small word\-level perturbations affect accuracy, showing that explicit CoT relies on brittle rationale cues\.Results and Analysis\.Table[4](https://arxiv.org/html/2606.14142#S4.T4)shows that performance is highly sensitive to rationale content\. Removing the target category more than halves Hit@5 \(0\.1165 to 0\.0540\) and NDCG@5 \(0\.0836 to 0\.0376\)\. Surface perturbations also matter: dropping five words reduces Hit@5 by 18\.5%, while adding five noise words reduces NDCG@5 by 18\.4%\. Explicit CoT thus depends on brittle rationale cues, especially whether the text preserves semantics needed for the target SID\.
### 4\.4Summary of Findings
Our diagnostics expose three CoT failures:weakened verbalizationleaves answer signals in logits but weakens decoding;text–SID embedding misalignmentlimits rationale effects on SID logits; andfragile rationalesmake metrics sensitive to small edits\. This motivatesimplicit reasoning in latent space: learned<pause\>tokens bridge language to SIDs without decoding brittle intermediate natural language reasoning text\.
## 5Methodology:PauseRec
Figure 3:Overview ofPauseRec\. Instead of generating explicit rationales and applying RL post\-training,PauseRecpretrains a<pause\>token to bridge text and SID representations, then inserts pause tokens before SID generation and trains them only through the final next\-item prediction loss\.Motivated by Section[4](https://arxiv.org/html/2606.14142#S4), we proposePauseRec, an implicit reasoning method for LLM\-based GR that keeps the CPT and next\-item SFT stages of explicit pipelines but replaces CoT SFT and RL with pause\-based latent computation\.
### 5\.1Overview ofPauseRec
As illustrated in Fig\.[3](https://arxiv.org/html/2606.14142#S5.F3), we perform CPT \(Section[2\.2](https://arxiv.org/html/2606.14142#S2.SS2)\), then conduct next\-item SFT and<pause\>pretraining in parallel on the same CPT checkpoint\. The SFT branch follows Section[2\.2](https://arxiv.org/html/2606.14142#S2.SS2); the pause branch finetunes on the CPT corpus with<pause\>tokens injected at random text positions so the token learns semantic transitions between language and SID tokens\. We then load the pretrained<pause\>embedding into the SFT checkpoint and run implicit reasoning SFT on next\-item data withkkpauses inserted between user history and target SID, optimizing only SID positions\.
PauseRecaddresses the three CoT failures above: \(1\)Computation without verbalizationvia latent<pause\>steps that need not be decoded as natural language; \(2\)Bridging embedding spacesvia CPT\-grounded pause pretraining \(Appendix[E](https://arxiv.org/html/2606.14142#A5)visualizes the trained<pause\>token positioned between embedding spaces\); and \(3\)Avoiding rationale supervisionby masking loss on pause positions and optimizing only target SID\.
### 5\.2<pause\>Token Initialization
We add<pause\>to the vocabulary and initialize its embedding at the mean of all token embeddings after CPT, with variance set to10−910^\{\-9\}times the embedding variance \(equivalent to a near\-deterministic start at the vocabulary center\):𝐞<pause\>\(0\)=1\|𝒱\|∑v∈𝒱𝐞v\\mathbf\{e\}\_\{\\texttt\{<pause\>\}\}^\{\(0\)\}=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}\\mathbf\{e\}\_\{v\}, where𝒱\\mathcal\{V\}is the full vocabulary\. This center initialization gives<pause\>a neutral starting point between text and SIDs\.
### 5\.3Two\-Stage Training
Stage 1:<pause\>Token Pretraining\.Starting from the CPT checkpoint, we finetune on the CPT corpus with<pause\>inserted at random positions covering 10% of each sequence \(Fig\.[3](https://arxiv.org/html/2606.14142#S5.F3)\)\. Only𝐞<pause\>\\mathbf\{e\}\_\{\\texttt\{<pause\>\}\}is trainable; all other parameters remain frozen\. This concentrates updates on the bridge token while preserving grounded SID embeddings and the pretrained language backbone\.
Stage 2: Implicit Reasoning SFT\.We load the pretrained𝐞<pause\>\\mathbf\{e\}\_\{\\texttt\{<pause\>\}\}into the SFT checkpoint and appendkkpauses between user history and target:
x′=Prompt\(H\)∥<pause\>,…,<pause\>⏟ktimesx^\{\\prime\}=\\text\{Prompt\}\(H\)\\\|\\underbrace\{\\texttt\{<pause\>\},\\ldots,\\texttt\{<pause\>\}\}\_\{k\\text\{ times\}\}\(5\)The LLM is finetuned with loss masked at<pause\>positions; only target SID tokens are optimized:
ℒimplicit=−∑l=1Llogpθ\(sn\+1\(l\)∣x′,sn\+1\(1:l−1\)\)\\vskip\-3\.61371pt\\mathcal\{L\}\_\{\\text\{implicit\}\}=\-\\sum\_\{l=1\}^\{L\}\\log p\_\{\\theta\}\\left\(s\_\{n\+1\}^\{\(l\)\}\\mid x^\{\\prime\},s\_\{n\+1\}^\{\(1:l\-1\)\}\\right\)\\vskip\-0\.72229pt\(6\)By not imposing loss on pause positions, we avoid imitating a fixed teacher rationale distribution and instead let the model use pauses only when they improve SID prediction\. In practice, pause slots act as task\-specific latent scratch space between the textual history and discrete SID outputs\. See Appendix[F\.1](https://arxiv.org/html/2606.14142#A6.SS1)for sample prompts and formatted training text for each stage ofPauseRec\.
### 5\.4Inference
Table 5:Main recommendation results on three Amazon datasets\.PauseRecconsistently improves over next\-item SFT and exceeds the baselines on most metrics; boldface and underlining mark the best and second\-best results\.At test time, we use the same prompt template as implicit\-reasoning SFT \(Appendix[F\.1](https://arxiv.org/html/2606.14142#A6.SS1)\), insertkkliteral<pause\>tokens between the<think\>and</think\>tags before the SID output, and autoregressively decode the next SID\. No rationale text is generated at inference, which removes the token overhead of explicit CoT while preserving a dedicated computation window before SID prediction\.
## 6Experiments
### 6\.1Experimental Setup
Datasets\.We evaluate on three Amazon review datasets\(Niet al\.,[2019](https://arxiv.org/html/2606.14142#bib.bib25)\): Beauty, Sports and Outdoors, and Toys and Games\. Following\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40)\), we filter users and items with fewer than five interactions and use a leave\-last\-out split: the final item is held out for testing, the second\-to\-last for validation, and the third\-to\-last as the training target with all earlier interactions as input\.
Baselines\.We compare \(1\)traditional sequential recommenders: GRU4Rec\(Hidasiet al\.,[2016](https://arxiv.org/html/2606.14142#bib.bib1)\), SASRec\(Kang and McAuley,[2018](https://arxiv.org/html/2606.14142#bib.bib2)\), BERT4Rec\(Sunet al\.,[2019](https://arxiv.org/html/2606.14142#bib.bib4)\), and HGN\(Maet al\.,[2019](https://arxiv.org/html/2606.14142#bib.bib49)\); \(2\)generative retrieval models: HSTU\(Zhaiet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib47)\)and TIGER\(Rajputet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib42)\); \(3\)LLM\-based models: next\-item SFT \(our reproduction\) and OneRec\-Think\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40)\)\(explicit CoT with RLVR\); and \(4\)implicit reasoning: ReaRec\(Linet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib24)\)\.
Metrics and Implementation\.We report Hit@5, Hit@10, NDCG@5, and NDCG@10\(Rajputet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib42)\)\. The backbone is Qwen3\-1\.7B\(Team,[2025](https://arxiv.org/html/2606.14142#bib.bib3)\)\. CPT runs for 3 epochs \(lr10−410^\{\-4\}\), pause pretraining for 2 epochs \(lr10−310^\{\-3\}\), and implicit SFT for 5 epochs \(lr5×10−55\\times 10^\{\-5\}\) with AdamW \(wd 0\.01\)\. Main results usek=5k\{=\}5pauses\. Training and evaluation prompts match the templates in Appendix[F\.1](https://arxiv.org/html/2606.14142#A6.SS1)\.
### 6\.2Effectiveness & Efficiency
Table 6:Efficiency comparison on Qwen3\-1\.7B with Amazon Beauty\. By avoiding RL post\-training and natural\-language rationale generation,PauseRecuses about 65% fewer training GPU hours and is roughly 3\.5×\\timesfaster per inference sample than OneRec\-Think\.We evaluate the effectiveness and efficiency ofPauseRec\. From Table[5](https://arxiv.org/html/2606.14142#S5.T5), we observe that \(1\)Consistent gains over next\-item SFT:PauseRecimproves every metric over the next\-item SFT baseline, with relative gains up to 8\.85% on Toys Hit@5\. \(2\)Competitive with or better than RL\-based CoT:PauseRecoutperforms OneRec\-Think on 10 of 12 metrics, including all Sports and Toys metrics, with up to 6\.22% relative improvement on Toys Hit@5; OneRec\-Think remains higher on Beauty Hit@10 and NDCG@10\. \(3\)Substantial gains over non\-LLM recommenders:PauseRecconsistently outperforms all baselines, highlighting the value of LLM knowledge through a SID\-compatible interface\.
From Table[6](https://arxiv.org/html/2606.14142#S6.T6),PauseRecreduces training GPU hours by 65% and inference latency by roughly 3\.5×\\timeson Beauty by avoiding RL post\-training and rationale generation\. OneRec\-Think uses relatively short template rationales here; inference savings grow with generated tokens \(Appendix[G\.1](https://arxiv.org/html/2606.14142#A7.SS1)\)\.
### 6\.3Ablation Studies
We compare our CPT\-grounded pause pretraining with alternative initializations before implicit SFT: the mean of text embeddings only, the mean of SID embeddings only, and the default special\-token initialization\.
Table 7:<pause\>initialization ablation on Beauty\. Pretraining the pause token to bridge text and SID contexts gives the best Hit@5 and NDCG@5, outperforming text\-only, SID\-only, and default initializations\.Table[7](https://arxiv.org/html/2606.14142#S6.T7)shows that the pretrained<pause\>token performs best, with modest but consistent gains over text\-only, SID\-only, and default initializations\. This supports pause pretraining for bridging text and SID embedding spaces\.
### 6\.4Parameter Analysis
We analyze the effect of pause countkk\. All settings share the same pause pretraining; we appendkkpauses during implicit SFT and use the samekkat inference\.
Figure 4:Effect of the number of<pause\>tokens\. Moderate latent computation works best \(k=5k\{=\}5\), while further increasingkkprovides no performance improvement\.Figure[4](https://arxiv.org/html/2606.14142#S6.F4)shows moderate values ofkkwork best \(k=5k\{=\}5\); increasing tok=10k\{=\}10does not consistently help \(Table[10](https://arxiv.org/html/2606.14142#A7.T10)\), suggesting useful latent computation saturates after some pause steps\.
### 6\.5Qualitative Analysis
To understand howPauseRecuses latent pause computation during inference, we analyze where each<pause\>token in the reasoning block attends in the surrounding context\. For each pause token, we average its outgoing attention to context tokens across all layers and heads, and visualize how that distribution changes relative to the preceding pause token \(Fig\.[5](https://arxiv.org/html/2606.14142#S6.F5)\)\.
Figure 5:Attention changes across<pause\>positions for a representative recommendation\. Early pause tokens attend broadly to the prompt and history boundary, while later pause tokens focus on a smaller set of historical SID tokens; red and blue indicate increased and decreased attention after each pause\.From Fig\.[5](https://arxiv.org/html/2606.14142#S6.F5)\(see full prompt in Appendix[F\.2](https://arxiv.org/html/2606.14142#A6.SS2)\), we observe a multi\-stage process\. \(1\)Context orientation:early pauses attend broadly to the instruction and history boundary, establishing that the next SID should be inferred from purchase history\. \(2\)Preference aggregation:middle and later pauses shift toward historical SIDs, identifying purchases relevant to user intent\. With pause position proceeds, the LLM focuses on a small salient subset of SIDs, locating items similar to target item\. This staged transition explains why latent pause computation improves GR\.
## 7Related Work
LLM\-based GR\.Recent work uses LLMs as rankers or feature extractors\(Houet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib13)\)and as generative recommenders that output SIDs\(Rajputet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib10); Huaet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib11)\)\. These pipelines typically use CPT on item\-text corpora to ground SIDs\(Baoet al\.,[2023](https://arxiv.org/html/2606.14142#bib.bib12)\), then apply next\-item SFT\. Our work builds on this foundation and asks when pretrained world knowledge improves SID prediction beyond standard training\.
Reasoning in LLMs\.Chain\-of\-Thought prompting improves reasoning on language tasks such as math\(Weiet al\.,[2022](https://arxiv.org/html/2606.14142#bib.bib29)\)and science\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2606.14142#bib.bib19)\), but SID\-based GR involves non\-linguistic outputs\. Recent GR systems add CoT SFT and RL on top of CPT\(Liuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib40); Yuet al\.,[2025](https://arxiv.org/html/2606.14142#bib.bib38); Lianget al\.,[2026](https://arxiv.org/html/2606.14142#bib.bib39)\); our stage\-wise analysis clarifies when those additions help\. Implicit reasoning via latent tokens appears in quiet CoT\(Zelikmanet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib23)\)and ReaRec\(Linet al\.,[2024](https://arxiv.org/html/2606.14142#bib.bib24)\); to our knowledge, we provide the first systematic explicit\-vs\-implicit comparison for LLM\-based GR with an analysis of CoT SFT failure\.
## 8Conclusion
This paper shows that explicit rationales are a poor interface for SID\-based generative recommendation: LLMs retain useful signals, but weakened verbalization, text–SID embedding mismatch, and rationale sensitivity limit CoT SFT\.PauseRecreplaces rationales with trainable<pause\>tokens, enabling latent reasoning that bridges language and SIDs\. Extensive experiments show thatPauseRecis effective and efficient\.
## References
- K\. Bao, J\. Zhang, Y\. Zhang, W\. Wang, F\. Feng, and X\. He \(2023\)TALLRec: an effective and efficient tuning framework to align large language model with recommendation\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14142#S2.SS1.p1.8),[§7](https://arxiv.org/html/2606.14142#S7.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)PIQA: reasoning about physical commonsense in natural language\.AAAI\.Cited by:[§4\.1](https://arxiv.org/html/2606.14142#S4.SS1.p2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.ArXiv\.Cited by:[§4\.1](https://arxiv.org/html/2606.14142#S4.SS1.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.ArXiv\.Cited by:[§4\.1](https://arxiv.org/html/2606.14142#S4.SS1.p2.1)\.
- B\. Hidasi, A\. Karatzoglou, L\. Baltrunas, and D\. Tikk \(2016\)Session\-based recommendations with recurrent neural networks\.InICLR,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1)\.
- Y\. Hou, J\. Zhang, Z\. Lin, H\. Lu, R\. Xie, J\. McAuley, and W\. X\. Zhao \(2023\)Large language models are zero\-shot rankers for recommender systems\.InICML,Cited by:[§7](https://arxiv.org/html/2606.14142#S7.p1.1)\.
- W\. Hua, Y\. Xu, Y\. Ge, Y\. Zhang, S\. Xu, J\. Tan, and Y\. Dong \(2023\)UP5: unbiased foundation model for fairness\-aware recommendation\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1),[§7](https://arxiv.org/html/2606.14142#S7.p1.1)\.
- J\. Huang and K\. C\. Chang \(2023\)Towards reasoning in large language models: a survey\.InFindings of ACL,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1)\.
- S\. Imani, L\. Du, and H\. Shrivastava \(2023\)Mathprompter: mathematical reasoning using large language models\.InACL,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- J\. Jiang, F\. Wang, J\. Shen, S\. Kim, and S\. Kim \(2026\)A survey on large language models for code generation\.ACM Trans\. Softw\. Eng\. Methodol\.\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- I\. Jolliffe \(2025\)Principal component analysis\.InInt\. Encycl\. Stat\. Sci\.,Cited by:[§4\.2](https://arxiv.org/html/2606.14142#S4.SS2.p2.1)\.
- W\. Kang and J\. McAuley \(2018\)Self\-attentive sequential recommendation\.InICDM,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.InNeurIPS,Cited by:[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- L\. Li, Y\. Zhang, and L\. Chen \(2021\)Personalized transformer for explainable recommendation\.InACL\-IJCNLP,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p2.1)\.
- M\. Liang, Y\. Li, J\. Xu, K\. Asadi, X\. Liu, S\. Gu, K\. Rangadurai, F\. Shyu, S\. Wang, S\. Yang,et al\.\(2026\)Generative reasoning re\-ranker\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p4.4),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p5.1),[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- X\. Lin, F\. Zhang, M\. Wang, W\. Zhou, X\. Huang, K\. He, S\. Chen, and L\. Liu \(2024\)ReaRec: reasoning for sequential recommendation\.InWWW,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1),[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- Z\. Liu, S\. Wang, X\. Wang, R\. Zhang, J\. Deng, H\. Bao, J\. Zhang, W\. Li, P\. Zheng, X\. Wu,et al\.\(2025\)Onerec\-think: in\-text reasoning for generative recommendation\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14142#S2.SS1.p1.8),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p4.4),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p5.1),[§3\.2](https://arxiv.org/html/2606.14142#S3.SS2.p4.1),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p1.1),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1),[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- C\. Ma, P\. Kang, and X\. Liu \(2019\)Hierarchical gating networks for sequential recommendation\.InKDD,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1)\.
- J\. Ni, J\. Li, and J\. McAuley \(2019\)Justifying recommendations using distantly\-labeled reviews and fine\-grained aspects\.InEMNLP\-IJCNLP,Cited by:[§3\.1](https://arxiv.org/html/2606.14142#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.14142#S3.SS2.p2.1),[Table 4](https://arxiv.org/html/2606.14142#S4.T4),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p1.1),[footnote 3](https://arxiv.org/html/2606.14142#footnote3)\.
- F\. Petroni, T\. Rocktäschel, S\. Riedel, P\. Lewis, A\. Bakhtin, Y\. Wu, and A\. Miller \(2019\)Language models as knowledge bases?\.InEMNLP\-IJCNLP,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- S\. Rajput, N\. Mehta, A\. Singh, R\. Hulikal Keshavan, T\. Vu, L\. Heldt, L\. Hong, Y\. Tay, V\. Tran, J\. Samost,et al\.\(2023\)Recommender systems with generative retrieval\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p3.4)\.
- S\. Rajput, N\. Mehta, A\. Singh, R\. H\. Keshavan, T\. Vu, L\. Heldt, L\. Hong, Y\. Tay, V\. Q\. Tran, J\. Samost,et al\.\(2024\)Recommender systems with generative retrieval\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14142#S2.SS1.p1.8),[§7](https://arxiv.org/html/2606.14142#S7.p1.1)\.
- F\. Sun, J\. Liu, J\. Wu, C\. Pei, X\. Lin, W\. Ou, and P\. Jiang \(2019\)BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer\.InCIKM,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.ArXiv\.Cited by:[§3\.2](https://arxiv.org/html/2606.14142#S3.SS2.p2.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§3\.1](https://arxiv.org/html/2606.14142#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.14142#S3.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.14142#S4.SS1.p2.1),[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p3.4)\.
- D\. Truhn, J\. S\. Reis\-Filho, and J\. N\. Kather \(2023\)Large language models should be used as scientific reasoning engines, not knowledge databases\.Nat\. Med\.\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- X\. Wang, J\. Cui, F\. Fukumoto, and Y\. Suzuki \(2025\)AGRec: adapting autoregressive decoders with graph reasoning for LLM\-based sequential recommendation\.InFindings of ACL,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1)\.
- wangshy31 \(2025\)OneRec\-Think\.Note:[https://github\.com/wangshy31/OneRec\-Think](https://github.com/wangshy31/OneRec-Think)GitHub repositoryCited by:[Appendix C](https://arxiv.org/html/2606.14142#A3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1),[§7](https://arxiv.org/html/2606.14142#S7.p2.1),[footnote 2](https://arxiv.org/html/2606.14142#footnote2)\.
- J\. Yu, X\. Wang, S\. Tu, S\. Cao, D\. Zhang\-Li, X\. Lv, H\. Peng, Z\. Yao, X\. Zhang, H\. Li,et al\.\(2024\)Kola: carefully benchmarking world knowledge of large language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1),[§1](https://arxiv.org/html/2606.14142#S1.p3.1)\.
- Q\. Yu, K\. Fu, S\. Zhang, Z\. Lv, F\. Wu, and F\. Wu \(2025\)ThinkRec: thinking\-based recommendation via LLM\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14142#S2.SS2.p5.1),[§3\.2](https://arxiv.org/html/2606.14142#S3.SS2.p4.1),[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- E\. Zelikman, G\. Harik, Y\. Shao, V\. Jayasiri, N\. Haber, and N\. D\. Goodman \(2024\)Quiet\-STAR: language models can teach themselves to think before speaking\.ArXiv\.Cited by:[§7](https://arxiv.org/html/2606.14142#S7.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InACL,Cited by:[§4\.1](https://arxiv.org/html/2606.14142#S4.SS1.p2.1)\.
- J\. Zhai, L\. Liao, X\. Liu, Y\. Wang, R\. Li, X\. Cao, L\. Gao, Z\. Gong, F\. Gu, J\. He,et al\.\(2024\)Actions speak louder than words: trillion\-parameter sequential transducers for generative recommendations\.InICML,Cited by:[§6\.1](https://arxiv.org/html/2606.14142#S6.SS1.p2.1)\.
- H\. Zhang, T\. Zhang, J\. Yin, O\. Gal, A\. Shrivastava, and V\. Braverman \(2025\)CoVE: compressed vocabulary expansion makes better LLM\-based recommender systems\.InFindings of ACL,Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1)\.
- L\. Zhang, Y\. Huang, H\. Lv, M\. Yin, L\. Li, Z\. Chen, H\. Wang, and E\. Chen \(2026\)Why thinking hurts? diagnosing and rectifying the reasoning shift in foundation recommender models\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1)\.
- W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong,et al\.\(2023\)A survey of large language models\.ArXiv\.Cited by:[§1](https://arxiv.org/html/2606.14142#S1.p1.1)\.
## Appendix ALimitations and Potential Risks
PauseRecleaves several natural extensions for future work\. First, our study uses a compact pause\-token design and reports sensitivity to pause length, but does not exhaustively tune all possible pause placements, initialization schedules, or decoding variants\. Second, our evaluation follows the standard offline next\-item prediction protocol; complementary user\-facing studies could further examine how latent reasoning affects perceived usefulness, diversity, and recommendation presentation\. Finally, because implicit<pause\>tokens are not natural\-language rationales, their intermediate computation is less directly readable by users, motivating additional probing and visualization tools for analyzing how pause tokens support SID prediction\. Like other recommender systems,PauseRecmay amplify popularity bias, reinforce historical user preferences too strongly, or inherit biases present in the interaction data and pretrained LLM\. Deployments should include appropriate mitigations\.
## Appendix BAI Usage
AI writing assistance was used only to polish grammar, clarity, and sentence flow\. All research ideas, experimental designs, analyses, results, and final claims were developed, checked, and approved by the authors\.
## Appendix CArtifacts
Our artifact release includes the code and scripts for constructing SID\-based training data, running the fourPauseRectraining stages, evaluating constrained SID decoding, and generating the tables and figures reported in the paper\. The implementation is based on the open\-source OneRec\-Think repository\(wangshy31,[2025](https://arxiv.org/html/2606.14142#bib.bib41)\); we extend it with pause\-token pretraining, implicit\-reasoning finetuning, evaluation utilities, and experiment orchestration forPauseRec\. OneRec\-Think is released under the Apache License 2\.0, which permits reuse and modification\. Our code is released under the MIT License\.
## Appendix DTheoretical Analysis of Text–SID Separation
We formalize why geometric separation between natural\-language and SID representations can weaken explicit CoT\. Consider the final step after a rationale, where the model must generate a SID token\. Letvsv\_\{s\}denote the output embedding of SID tokenss, and let the SID logit be
zs\(h\)=vs⊤h,z\_\{s\}\(h\)=v\_\{s\}^\{\\top\}h,wherehhis the hidden state before SID generation\. Let𝒰text\\mathcal\{U\}\_\{\\text\{text\}\}be the subspace in which hidden states move when the model generates or is optimized on natural\-language rationale tokens, and let
𝒰SID=span\{vy−vs:y,s∈𝒮\}\\mathcal\{U\}\_\{\\text\{SID\}\}=\\mathrm\{span\}\\\{v\_\{y\}\-v\_\{s\}:y,s\\in\\mathcal\{S\}\\\}be the subspace that controls relative SID logits\. Define the text–SID coupling coefficient
ρ=‖P𝒰SIDP𝒰text‖2,\\rho=\\\|P\_\{\\mathcal\{U\}\_\{\\text\{SID\}\}\}P\_\{\\mathcal\{U\}\_\{\\text\{text\}\}\}\\\|\_\{2\},whereP𝒰P\_\{\\mathcal\{U\}\}is the orthogonal projection onto subspace𝒰\\mathcal\{U\}\. Smallerρ\\rhomeans stronger separation between text\-induced hidden\-state movement and SID\-discriminative directions\.
###### Theorem 1\(Text–SID separation bounds the effect of verbal rationales\)\.
Suppose adding a natural\-language rationale changes the hidden state before SID generation fromhhtoh\+Δh\+\\Delta, where
Δ=Δtext\+r,Δtext∈𝒰text,‖r‖≤ϵ\.\\Delta=\\Delta\_\{\\text\{text\}\}\+r,\\quad\\Delta\_\{\\text\{text\}\}\\in\\mathcal\{U\}\_\{\\text\{text\}\},\\quad\\\|r\\\|\\leq\\epsilon\.Assume‖vy−vs‖≤B\\\|v\_\{y\}\-v\_\{s\}\\\|\\leq Bfor all valid SID tokensy,s∈𝒮y,s\\in\\mathcal\{S\}\. Let
ρ=‖P𝒰SIDP𝒰text‖2\.\\rho=\\\|P\_\{\\mathcal\{U\}\_\{\\mathrm\{SID\}\}\}P\_\{\\mathcal\{U\}\_\{\\text\{text\}\}\}\\\|\_\{2\}\.Then for any target SID tokenyyand competing SID tokenss,
\|\(zy\(h\+Δ\)−zs\(h\+Δ\)\)−\(zy\(h\)−zs\(h\)\)\|\\displaystyle\\left\|\\bigl\(z\_\{y\}\(h\+\\Delta\)\-z\_\{s\}\(h\+\\Delta\)\\bigr\)\-\\bigl\(z\_\{y\}\(h\)\-z\_\{s\}\(h\)\\bigr\)\\right\|\(7\)≤B\(ρ‖Δtext‖\+ϵ\)\.\\displaystyle\\leq B\(\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\+\\epsilon\)\.
Consequently, if the target SID initially trails some competitor by marginγ\\gamma,
zy\(h\)−zs\(h\)≤−γ,z\_\{y\}\(h\)\-z\_\{s\}\(h\)\\leq\-\\gamma,and
γ\>B\(ρ‖Δtext‖\+ϵ\),\\gamma\>B\(\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\+\\epsilon\),then the rationale cannot makeyyoutrankss\.
###### Proof\.
Let the SID margin between target tokenyyand competing tokenssbe
m\(h\)=zy\(h\)−zs\(h\)\.m\(h\)=z\_\{y\}\(h\)\-z\_\{s\}\(h\)\.The change in this margin after adding the rationale is
m\(h\+Δ\)−m\(h\)=\(zy\(h\+Δ\)−zs\(h\+Δ\)\)\\displaystyle m\(h\+\\Delta\)\-m\(h\)=\\bigl\(z\_\{y\}\(h\+\\Delta\)\-z\_\{s\}\(h\+\\Delta\)\\bigr\)\(8\)−\(zy\(h\)−zs\(h\)\)\.\\displaystyle\-\\bigl\(z\_\{y\}\(h\)\-z\_\{s\}\(h\)\\bigr\)\.Sincezs\(h\)=vs⊤hz\_\{s\}\(h\)=v\_\{s\}^\{\\top\}h, we have
m\(h\+Δ\)−m\(h\)=\(vy−vs\)⊤Δ\.m\(h\+\\Delta\)\-m\(h\)=\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}\\Delta\.SubstitutingΔ=Δtext\+r\\Delta=\\Delta\_\{\\text\{text\}\}\+rgives
\|\(vy−vs\)⊤Δ\|≤\|\(vy−vs\)⊤Δtext\|\+\|\(vy−vs\)⊤r\|\.\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}\\Delta\|\\leq\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}\\Delta\_\{\\text\{text\}\}\|\+\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}r\|\.Because only the projection ofΔtext\\Delta\_\{\\text\{text\}\}onto the SID\-discriminative subspace can affect relative SID logits,
\|\(vy−vs\)⊤Δtext\|≤‖vy−vs‖⋅‖P𝒰SIDΔtext‖\.\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}\\Delta\_\{\\text\{text\}\}\|\\leq\\\|v\_\{y\}\-v\_\{s\}\\\|\\cdot\\\|P\_\{\\mathcal\{U\}\_\{\\mathrm\{SID\}\}\}\\Delta\_\{\\text\{text\}\}\\\|\.SinceΔtext∈𝒰text\\Delta\_\{\\text\{text\}\}\\in\\mathcal\{U\}\_\{\\text\{text\}\},
P𝒰SIDΔtext=P𝒰SIDP𝒰textΔtext\.P\_\{\\mathcal\{U\}\_\{\\mathrm\{SID\}\}\}\\Delta\_\{\\text\{text\}\}=P\_\{\\mathcal\{U\}\_\{\\mathrm\{SID\}\}\}P\_\{\\mathcal\{U\}\_\{\\text\{text\}\}\}\\Delta\_\{\\text\{text\}\}\.By the definition ofρ\\rho,
‖P𝒰SIDP𝒰textΔtext‖≤ρ‖Δtext‖\.\\\|P\_\{\\mathcal\{U\}\_\{\\mathrm\{SID\}\}\}P\_\{\\mathcal\{U\}\_\{\\text\{text\}\}\}\\Delta\_\{\\text\{text\}\}\\\|\\leq\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\.Therefore,
\|\(vy−vs\)⊤Δtext\|≤Bρ‖Δtext‖\.\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}\\Delta\_\{\\text\{text\}\}\|\\leq B\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\.For the residual term,
\|\(vy−vs\)⊤r\|≤‖vy−vs‖‖r‖≤Bϵ\.\|\(v\_\{y\}\-v\_\{s\}\)^\{\\top\}r\|\\leq\\\|v\_\{y\}\-v\_\{s\}\\\|\\\|r\\\|\\leq B\\epsilon\.Combining the two bounds yields
\|m\(h\+Δ\)−m\(h\)\|≤B\(ρ‖Δtext‖\+ϵ\)\.\|m\(h\+\\Delta\)\-m\(h\)\|\\leq B\(\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\+\\epsilon\)\.
It remains to show the ranking consequence\. Define
M=B\(ρ‖Δtext‖\+ϵ\)\.M=B\(\\rho\\\|\\Delta\_\{\\text\{text\}\}\\\|\+\\epsilon\)\.The bound above implies that the rationale can change the SID margin by at mostMM\. If the target SID initially trails competitorssby marginγ\\gamma, then
m\(h\)=zy\(h\)−zs\(h\)≤−γ\.m\(h\)=z\_\{y\}\(h\)\-z\_\{s\}\(h\)\\leq\-\\gamma\.After adding the rationale,
m\(h\+Δ\)\\displaystyle m\(h\+\\Delta\)=m\(h\)\+\(m\(h\+Δ\)−m\(h\)\)\\displaystyle=m\(h\)\+\\bigl\(m\(h\+\\Delta\)\-m\(h\)\\bigr\)≤m\(h\)\+M\.\\displaystyle\\leq m\(h\)\+M\.Sincem\(h\)≤−γm\(h\)\\leq\-\\gamma, we obtain
m\(h\+Δ\)≤−γ\+M\.m\(h\+\\Delta\)\\leq\-\\gamma\+M\.Ifγ\>M\\gamma\>M, then
Equivalently,
zy\(h\+Δ\)<zs\(h\+Δ\)\.z\_\{y\}\(h\+\\Delta\)<z\_\{s\}\(h\+\\Delta\)\.Thus, even after adding the rationale, the target SID tokenyystill receives a lower logit than the competing SID tokenss, soyycannot outrankss\. ∎
## Appendix EPauseRecEmbedding Visualization
Figure 6:Token embeddings after each stage of thePauseRecpipeline\. The learned<pause\>token lies near the boundary between natural\-language tokens and SID tokens, indicating that pause pretraining positions it as a bridge between the two embedding spaces\. This supports the role of<pause\>tokens in connecting semantic information from natural language to SID prediction\.Figure[6](https://arxiv.org/html/2606.14142#A5.F6)visualizes token embeddings after each stage of thePauseRecpipeline, including CPT, pause\-token pretraining, next\-item SFT, and implicit\-reasoning SFT\. Across stages, the<pause\>token stays at the boundary between the natural\-language token cluster and the SID token cluster rather than collapsing into either group\. This boundary placement provides empirical evidence that the pause token connects semantics across the two embedding spaces, helping route natural\-language knowledge toward SID generation and thereby improving recommendation performance\.
## Appendix FImplementation Details
### F\.1Sample Prompts forPauseRecTraining and Inference
We provide concrete prompt text for each stage ofPauseRecon Amazon Beauty, using the same leave\-last\-out training split as in Section[6](https://arxiv.org/html/2606.14142#S6)\. The example user has two items in the prediction history; the held\-out target item is*Raw African Black Soap from Ghana 1 Lb*\(semantic ID shown in the implicit\-reasoning blocks below\)\. Chat\-based stages use the fixed system instruction and chat turn delimiters shown in the full\-sequence examples\. During pause pretraining,<pause\>tokens are inserted at random word boundaries\. During implicit\-reasoning SFT and inference,k=5k\{=\}5pause tokens are placed between the<think\>and</think\>tags before the target SID; loss is masked on those pause positions during SFT\.
Continual pretraining \(CPT\)\.
The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>, its title is "Phyto Phytocitrus Restructuring Mask for Unisex, 6\.7 Ounce", its categories are "Beauty \> Hair Care \> Conditioners"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>, its title is "Matrix Biolage Colorcaretherapie Color Care Shampoo and Conditioner Set 33\.8oz 1 Liter", its categories are "Beauty \> Hair Care \> Shampoo & Conditioner Sets"; <\|sid\_begin\|\><s\_a\_226\><s\_b\_110\><s\_c\_129\><s\_d\_207\><\|sid\_end\|\>, its title is "Raw African Black Soap from Ghana 1 Lb", its categories are "Beauty \> Bath & Body \> Cleansers \> Soaps";
Pause\-token pretraining \(10% random <pause\> insertion; seed 42\)\.
The user has <pause\> purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>, its title is "Phyto <pause\> <pause\> Phytocitrus Restructuring <pause\> Mask for Unisex, 6\.7 Ounce", its categories are "Beauty \> <pause\> Hair Care <pause\> \> Conditioners"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>, <pause\> its title is "Matrix Biolage Colorcaretherapie Color Care Shampoo and Conditioner Set 33\.8oz 1 Liter", its categories are "Beauty \> Hair Care \> Shampoo & Conditioner Sets"; <\|sid\_begin\|\><s\_a\_226\><s\_b\_110\><s\_c\_129\><s\_d\_207\><\|sid\_end\|\>, its title is "Raw African Black Soap from Ghana 1 Lb", its categories are "Beauty \> Bath <pause\> & Body \> Cleansers \> Soaps";
Next\-item supervised finetuning: user prompt\.
The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>;
Next\-item supervised finetuning: full sequence \(empty thinking block\)\.
<\|im\_start\|\>system You are a professional recommendation expert who needs to recommend the next possible purchase for users based on their purchase history\. Please predict the most likely next product that the user will purchase based on the user’s historical purchase information\.<\|im\_end\|\> <\|im\_start\|\>user The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>;<\|im\_end\|\> <\|im\_start\|\>assistant <think\></think\> <\|sid\_begin\|\><s\_a\_226\><s\_b\_110\><s\_c\_129\><s\_d\_207\><\|sid\_end\|\><\|im\_end\|\>
Implicit\-reasoning finetuning: user prompt and target SID\.
The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>, its title is "Phyto Phytocitrus Restructuring Mask for Unisex, 6\.7 Ounce", its categories are "Beauty \> Hair Care \> Conditioners"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>, its title is "Matrix Biolage Colorcaretherapie Color Care Shampoo and Conditioner Set 33\.8oz 1 Liter", its categories are "Beauty \> Hair Care \> Shampoo & Conditioner Sets"; Target SID: <\|sid\_begin\|\><s\_a\_226\><s\_b\_110\><s\_c\_129\><s\_d\_207\><\|sid\_end\|\>
Implicit\-reasoning finetuning: full sequence \(k=5k\{=\}5pause tokens\)\.
<\|im\_start\|\>system You are a professional recommendation expert who needs to recommend the next possible purchase for users based on their purchase history\. Please predict the most likely next product that the user will purchase based on the user’s historical purchase information\.<\|im\_end\|\> <\|im\_start\|\>user The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>, its title is "Phyto Phytocitrus Restructuring Mask for Unisex, 6\.7 Ounce", its categories are "Beauty \> Hair Care \> Conditioners"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>, its title is "Matrix Biolage Colorcaretherapie Color Care Shampoo and Conditioner Set 33\.8oz 1 Liter", its categories are "Beauty \> Hair Care \> Shampoo & Conditioner Sets";<\|im\_end\|\> <\|im\_start\|\>assistant <think\> <pause\><pause\><pause\><pause\><pause\> </think\> <\|sid\_begin\|\><s\_a\_226\><s\_b\_110\><s\_c\_129\><s\_d\_207\><\|sid\_end\|\><\|im\_end\|\>
Inference: prompt prefix before constrained SID decoding\.
<\|im\_start\|\>system You are a professional recommendation expert who needs to recommend the next possible purchase for users based on their purchase history\. Please predict the most likely next product that the user will purchase based on the user’s historical purchase information\.<\|im\_end\|\> <\|im\_start\|\>user The user has purchased the following items: <\|sid\_begin\|\><s\_a\_99\><s\_b\_19\><s\_c\_220\><s\_d\_204\><\|sid\_end\|\>, its title is "Phyto Phytocitrus Restructuring Mask for Unisex, 6\.7 Ounce", its categories are "Beauty \> Hair Care \> Conditioners"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_74\><s\_c\_13\><s\_d\_122\><\|sid\_end\|\>, its title is "Matrix Biolage Colorcaretherapie Color Care Shampoo and Conditioner Set 33\.8oz 1 Liter", its categories are "Beauty \> Hair Care \> Shampoo & Conditioner Sets";<\|im\_end\|\> <\|im\_start\|\>assistant <think\> <pause\><pause\><pause\><pause\><pause\> </think\>
### F\.2Prompt for Qualitative Attention Analysis
Figure[5](https://arxiv.org/html/2606.14142#S6.F5)visualizes the pause\-token attention pattern for the following example\. The target item is a Dove hair styling spray, and the history contains multiple hair\-care and styling products\. In the later pause steps, attention to the history item whose SID begins with <s\_a\_206\><s\_b\_60\>,*Natures Bounty Optimal Solutions Hair, Skin and Nails Gummies*, increases\. This item is related to the target through hair\-care intent, so the increase supports the main\-paper analysis that later pauses retrieve and aggregate target\-relevant historical evidence before SID generation\.
Target Item Title: Dove Hair Styling Oxygen Moisture Root Lift Spray, 3\.3 Ounce SID: <\|sid\_begin\|\><s\_a\_140\><s\_b\_39\><s\_c\_151\><s\_d\_68\><\|sid\_end\|\> Categories: Beauty \> Hair Care \> Styling Products \> Hair Sprays
Full prompt text\.
<\|im\_start\|\>system You are a professional recommendation expert who needs to recommend the next possible purchase for users based on their purchase history\. Please predict the most likely next product that the user will purchase based on the user’s historical purchase information\.<\|im\_end\|\> <\|im\_start\|\>user The user has purchased the following items: <\|sid\_begin\|\><s\_a\_6\><s\_b\_192\><s\_c\_205\><s\_d\_33\><\|sid\_end\|\>, its title is "Axe Primed Just Clean Shampoo, 12\-Ounce Bottle \(Pack of 3\)", its categories are "Beauty \> Hair Care \> Shampoos"; <\|sid\_begin\|\><s\_a\_248\><s\_b\_8\><s\_c\_99\><s\_d\_150\><\|sid\_end\|\>, its title is "Olay Regenerist Micro\-Sculpting Serum 1\.7 Fl Oz", its categories are "Beauty \> Skin Care \> Face"; <\|sid\_begin\|\><s\_a\_113\><s\_b\_56\><s\_c\_77\><s\_d\_2\><\|sid\_end\|\>, its title is "Clearasil Ultra Acne Treatment Daily Face Wash, 6\.78 Ounce \(Pack of 3\)", its categories are "Beauty \> Skin Care \> Face \> Cleansers \> Washes"; <\|sid\_begin\|\><s\_a\_6\><s\_b\_6\><s\_c\_17\><s\_d\_210\><\|sid\_end\|\>, its title is "Pantene Pro\-V Expert Collection Agedefy Conditioner 8\.4 Fl Oz", its categories are "Beauty \> Hair Care \> Conditioners"; <\|sid\_begin\|\><s\_a\_6\><s\_b\_222\><s\_c\_222\><s\_d\_71\><\|sid\_end\|\>, its title is "Pantene Pro\-V Expert Collection Agedefy Shampoo 10\.1 Fl Oz", its categories are "Beauty \> Hair Care \> Shampoos"; <\|sid\_begin\|\><s\_a\_255\><s\_b\_71\><s\_c\_242\><s\_d\_99\><\|sid\_end\|\>, its title is "Burt’s Bees Lip Gloss, Autumn Haze, 0\.2 Fluid Ounces", its categories are "Beauty \> Makeup \> Lips \> Lip Glosses"; <\|sid\_begin\|\><s\_a\_155\><s\_b\_51\><s\_c\_96\><s\_d\_246\><\|sid\_end\|\>, its title is "Nexxus Youth Renewal Rejuvenating Shampoo, 13\.5 Ounce", its categories are "Beauty \> Hair Care \> Shampoos"; <\|sid\_begin\|\><s\_a\_248\><s\_b\_86\><s\_c\_54\><s\_d\_216\><\|sid\_end\|\>, its title is "Simple Protecting Light Moisturizer Spf 15, 4\.2 Ounce", its categories are "Beauty \> Skin Care \> Face \> Creams & Moisturizers \> Fluids & Lotions \> Lotions"; <\|sid\_begin\|\><s\_a\_113\><s\_b\_255\><s\_c\_31\><s\_d\_115\><\|sid\_end\|\>, its title is "Dove go fresh, Burst Body Wash, 24 Ounce \(Pack of 2\)", its categories are "Beauty \> Bath & Body \> Cleansers \> Body Washes"; <\|sid\_begin\|\><s\_a\_140\><s\_b\_254\><s\_c\_162\><s\_d\_130\><\|sid\_end\|\>, its title is "Nexxus Youth Renewal Plump and Lift Blow Dry Spray, 7\.5 Ounce", its categories are "Beauty \> Hair Care \> Styling Products \> Hair Sprays"; <\|sid\_begin\|\><s\_a\_140\><s\_b\_205\><s\_c\_36\><s\_d\_36\><\|sid\_end\|\>, its title is "Nexxus Youth Renewal Rejuvenating Elixir, 0\.94 Ounce", its categories are "Beauty \> Hair Care \> Hair & Scalp Treatments"; <\|sid\_begin\|\><s\_a\_155\><s\_b\_190\><s\_c\_158\><s\_d\_81\><\|sid\_end\|\>, its title is "CLEAR MEN SCALP THERAPY 2 in 1 AntiDandruff Shampoo and Conditioner, Dry Scalp Hydration, 12\.9oz", its categories are "Beauty \> Hair Care \> Shampoos"; <\|sid\_begin\|\><s\_a\_113\><s\_b\_67\><s\_c\_140\><s\_d\_115\><\|sid\_end\|\>, its title is "Schick Hydro Silk Disposable Razor, 3 Count", its categories are "Beauty \> Skin Care \> Body \> Moisturizers \> Oils"; <\|sid\_begin\|\><s\_a\_21\><s\_b\_45\><s\_c\_114\><s\_d\_49\><\|sid\_end\|\>, its title is "Own Products Refining Moisture Night Cream", its categories are "Beauty \> Skin Care \> Face \> Creams & Moisturizers \> Night Creams"; <\|sid\_begin\|\><s\_a\_113\><s\_b\_204\><s\_c\_185\><s\_d\_233\><\|sid\_end\|\>, its title is "Nivea Q10 Skin Firming Body Lotion , 13\.5 fl oz \(Pack of 2\)", its categories are "Beauty \> Skin Care \> Body \> Moisturizers \> Lotions"; <\|sid\_begin\|\><s\_a\_135\><s\_b\_169\><s\_c\_250\><s\_d\_60\><\|sid\_end\|\>, its title is "L’Oreal Paris Age Perfect Hydra\-Nutrition Moisturizer, 1\.7\-Fluid Ounce", its categories are "Beauty \> Skin Care \> Face \> Creams & Moisturizers \> Fluids & Lotions \> Fluids"; <\|sid\_begin\|\><s\_a\_238\><s\_b\_3\><s\_c\_5\><s\_d\_18\><\|sid\_end\|\>, its title is "Cristophe Professional Glossing Shampoo, 10 Ounce", its categories are "Beauty \> Hair Care \> Shampoos"; <\|sid\_begin\|\><s\_a\_140\><s\_b\_173\><s\_c\_52\><s\_d\_91\><\|sid\_end\|\>, its title is "Tresemme Keratin Smooth Smoothing Creme Serum, 3\.5 Ounce", its categories are "Beauty \> Hair Care \> Styling Products \> Creams, Gels & Lotions"; <\|sid\_begin\|\><s\_a\_206\><s\_b\_60\><s\_c\_224\><s\_d\_48\><\|sid\_end\|\>, its title is "Natures Bounty Optimal Solutions Hair, Skin and Nails Gummies, 80 Count", its categories are "Beauty \> Skin Care"; <\|sid\_begin\|\><s\_a\_140\><s\_b\_177\><s\_c\_103\><s\_d\_200\><\|sid\_end\|\>, its title is "Dove Hair Styling Oxygen Moisture Leave In Foam, 5\.1 Ounce", its categories are "Beauty \> Hair Care \> Styling Products \> Mousses & Foams";<\|im\_end\|\> <\|im\_start\|\>assistant <think\> <pause\><pause\><pause\> </think\>
### F\.3SID Metadata Decoding Prompt
For the metadata recovery experiment in Table[1](https://arxiv.org/html/2606.14142#S3.T1), we use the first two items in each dataset’s pretraining file as in\-context examples and then query the remaining items\. No judge model is used; predictions are evaluated by exact string match after parsing the generated Title: and Category: fields\. The prompt template is shown below, with the final user turn line\-wrapped for readability:
<\|im\_start\|\>system Please generate the title and category of the product based on its semantic ID\.<\|im\_end\|\> <\|im\_start\|\>user \{shot\_1\_sid\}<\|im\_end\| <\|im\_start\|\>assistant Title: "\{shot\_1\_title\}", Category: "\{shot\_1\_categories\}" <\|im\_end\|\> <\|im\_start\|\>user \{shot\_2\_sid\}<\|im\_end\| <\|im\_start\|\>assistant Title: "\{shot\_2\_title\}", Category: "\{shot\_2\_categories\}" <\|im\_end\|\> <\|im\_start\|\>user Generate the Title and Category of this product \{query\_sid\}\. Please only generate with NO thinking\! <\|im\_end\|\> <\|im\_start\|\>assistant
Generation uses greedy decoding \(do\_sample=False\) with max\_new\_tokens=256\.
### F\.4Sample Rationales Used in CoT SFT
Below we provide the reasoning field from the first sample of each CoT SFT variant\.
Template\-Category\.
The user is likely to buy items in the Beauty\>\>Tools & Accessories\>\>Mirrors\>\>Makeup Mirrors category\.
Template\-Extended\.
The user demonstrates interest in hair care and skin care products\. By identifying the user’s preference in natural, high\-quality beauty and personal care items, we can predict that, to complement their previous purchases and enhance their daily grooming routine, the user will purchase beauty tools, for example, a Home Travel 9X/1X Folding Lighted Cosmetic Mirror\.
Gemini 3\.1 Flash\-Lite Free\-form\.
The user consistently purchases hair care products and beauty tools, suggesting a strong interest in hair maintenance and styling\. Given the previous acquisition of hair care treatments and a cosmetic mirror, the model can infer a logical progression toward purchasing functional hair styling accessories like clips to complement their existing routine\.
Gemini 3\.1 Pro Free\-form\.
The user’s purchase history demonstrates a strong focus on hair care and beauty maintenance, moving from washing and conditioning treatments to grooming tools like a cosmetic mirror\. An LLM would logically infer that after acquiring products to clean and treat their hair, the user’s next step in their routine would be purchasing styling accessories like hair clips to manage and style it\.
Gemini 3\.1 Flash\-Lite Rejection\.
The user consistently purchases hair care products and beauty tools, indicating a strong focus on hair maintenance and grooming\. By identifying the pattern of hair\-related purchases, the model can infer a logical progression toward styling accessories like hair clips to complement the existing hair care routine\.
Gemini 3\.1 FL Gemini Rejection\.
The user consistently invests in high\-quality, natural, and specialized personal grooming products, suggesting a transition from purchasing consumable maintenance items to acquiring tools that facilitate their beauty and self\-care routine\. A lighted cosmetic mirror serves as a logical functional upgrade to complement their established regimen of premium hair and skin care products\.
Gemini 3\.1 Flash\-Lite Restricted\.
Step 1: The user’s purchase history shows a consistent focus on beauty and personal care products\. Step 2: This suggests the user is expanding from consumables to grooming tools\. Step 3: A lighted cosmetic mirror is the most logical next purchase because it supports the user’s existing routine\.
### F\.5Teacher Model Prompts for Reasoning Generation in CoT SFT
#### F\.5\.1Free\-form Reasoning Prompt
Given a user’s purchase history, explain step\-by\-step how an LLM might reason to predict the next item they would purchase\.User’s purchase history: \{description\}The ground truth next item is: \- Title: "\{groundtruth\_title\}" \- Categories: "\{groundtruth\_categories\}"Provide concise reasoning \(1\-2 sentences\) explaining how an LLM could logically infer this next purchase from the user’s prior items\. Focus on patterns in categories, titles, or user preferences that would lead to this prediction\. Do not include any preamble or labels—output only the reasoning text\.
#### F\.5\.2Format\-Restricted Reasoning Prompt
P1: System Role & Task Definition You are an expert at analyzing e\-commerce purchase patterns and predicting user preferences\. You are helping create reasoning traces for recommendation training data\. You will receive a user’s purchase history, a list of candidate items, and one known positive target item\. Your goal is to produce a predictive rationale: first infer the user’s most likely next need from the purchase history, then verify whether the known target item is the best available match\.P2: Collaborative Context Presentation === USER PURCHASE HISTORY === \{history\_block\}=== CANDIDATE ITEMS === \{candidate\_block\}=== KNOWN TARGET ITEM === Candidate \{target\_candidate\_id\}: Title: \{target\_record\["title"\]\}; Categories: \{target\_record\["categories"\]\}; SID: \{target\_record\["sid"\]\}P3: Reasoning Procedure Work in two phases\. Phase A: Before relying on the known target item, identify the purchase\-history items with the strongest evidence\. Cite the smallest number of SIDs needed to support the inferred pattern, usually 1\-4\. Based on those items, infer the user’s likely next need, replenishment signal, complement need, routine continuation, tool\-versus\-consumable transition, or category progression\. Phase B: Then evaluate whether the known target item matches that inferred need better than the strongest one or two alternatives in the candidate list\. Prefer concrete signals such as repetition, recency, category progression, complementarity, brand continuity, and tool\-versus\-consumable transitions when they are supported by the evidence\.P4: Critical Guidelines as Output Constraint CRITICAL GUIDELINES: 1\. When referring to purchase\-history items, cite them directly using their SID\. 2\. In Step 1 and Step 2, do not mention the target title, target SID, or candidate numbers\. 3\. Use only evidence visible in the provided titles and categories\. 4\. Do not hallucinate hidden preferences or unsupported product attributes\. 5\. In Step 3, compare against at most two real alternative candidates; do not discuss the full candidate list\. 6\. If the target is not strongly supported, say "the fit is weaker than ideal but still the best available match" instead of inventing certainty\. 7\. Keep the rationale concise, specific, and information\-dense\. Avoid generic filler\. 8\. Do not output markdown fences\.P5: Structured Multi\-Step Reasoning Format Return strict JSON with a single key "reasoning"\. The value must be one string in this format: Step 1: History\-only evidence summary citing the most relevant SIDs and the dominant recurring pattern\. Step 2: Based only on the history, infer the user’s most likely next need or next category, and state confidence as high, medium, or low\. Step 3: Now evaluate why the target item "\{target\_record\["title"\]\}" is the best available match to that inferred need relative to the strongest one or two alternatives\. Summary: One short concluding sentence, under 20 words\.
### F\.6General Language Benchmark Prompts and Generated Reasoning
For the diagnostic experiment in Table[3](https://arxiv.org/html/2606.14142#S4.T3), we evaluated the target model on MMLU, HellaSwag, PIQA, and ARC\-Challenge using a general\-language benchmark run\.
Text\-generation prompt\.For the target model’s text generation, we used a chat\-style prompt that explicitly opens a thinking block:
<\|im\_start\|\>user Answer the following multiple\-choice question\. You may explain briefly if useful, but finish with ’Final answer: <letter\>’\. \{multiple\-choice block\} <\|im\_end\|\> <\|im\_start\|\>assistant <think\>
The multiple\-choice block lists the question followed by lettered options and ends with Final answer:\. For MMLU only, the block contains five in\-context examples sampled from the MMLU development split before the test question; the other benchmarks are zero\-shot\. HellaSwag questions are prefixed with Complete the sentence:\. PIQA uses the two candidate solutions as options A/B\. ARC\-Challenge uses the answer choices from the dataset after normalizing them to A/B/C/D labels\.
Logit prompt\.For logit\-based accuracy, we did not ask the model to generate a rationale\. Instead, we used the following prompt and compared the next\-token logits assigned to the option letters:
Answer the following multiple\-choice question with the option letter only\. \{multiple\-choice block ending with Answer:\}
MMLU again uses five in\-context examples, each ending with Answer: \{gold letter\}, followed by the test question ending with Answer:\. HellaSwag, PIQA, and ARC\-Challenge use the same zero\-shot question blocks as above but end with Answer:\.
What the model actually generated\.Text generations used greedy decoding \(do\_sample=False\) with max\_new\_tokens=128\. The extracted reasoning\_text field was non\-empty for all 27,094 target examples\. However, the generated text was usually not a multi\-step rationale\. It was most often a short answer\-likelihood statement inside the<think\>block, followed after</think\>by SID tokens\. For example, the first ARC\-Challenge row has raw generation:
The user is likely to answer C </think\> <\|sid\_begin\|\><s\_a\_91\><s\_b\_84\> <s\_c\_156\><s\_d\_20\><\|sid\_end\|\><\|im\_end\|\>
Thus, the content inside<think\>\.\.\.</think\>for that example is exactly:
The user is likely to answer C\.
Representative extracted reasoning\_text entries from the artifact are:
- •MMLU: The user is likely to answer D
- •HellaSwag: The user is likely to answer D
- •PIQA: The user is likely to answer B
- •ARC\-Challenge: The user is likely to answer C
Across all target examples, the most frequent extracted reasoning strings were The user is likely to answer C \(11,891 examples\), The user is likely to answer B \(5,938\), The user is likely to answer D \(5,118\), and The user is likely to answer A \(1,732\)\. There were 215 unique extracted reasoning strings in total\. In 27,074 of 27,094 examples, the text after</think\>began with SID tokens rather than a natural\-language final answer\. These artifacts show that the model learned to fill the thinking block with a shallow answer\-prediction phrase, while the explicitly verbalized answer remained unreliable\.
### F\.7Dataset Statistics
Table 8:Dataset statistics after preprocessing\. The three Amazon benchmarks span 11\.9K–18\.4K items and 167K–296K interactions, providing recommendation tasks of different scales\.
## Appendix GComplementary Experiment Results
### G\.1Inference Speed of CoT SFT Variants
Table 9:Inference latency for CoT SFT variants andPauseRecon 500 Beauty samples\.PauseRecis fastest because it uses fixed<pause\>tokens instead of generating natural\-language rationales\.Table[9](https://arxiv.org/html/2606.14142#A7.T9)shows that all CoT SFT variants incur substantially higher inference latency thanPauseRec\. Even the shortest template rationale is about 3\.5×\\timesslower thanPauseRec, while the longer template and teacher\-generated variants are roughly 5\.5–7\.1×\\timesslower\. This gap comes from the need to autoregressively generate rationale tokens before decoding the SID\. In contrast,PauseRecinserts a fixed number of<pause\>tokens and immediately proceeds to constrained SID decoding, so its latency is largely independent of rationale length\.
We ran the benchmark on a single NVIDIA A100\-SXM4\-80GB GPU\. We did not use vLLM\. The benchmark used Hugging Face AutoModelForCausalLM\.generate with PyTorch, fp16 on CUDA, batch size 16, greedy decoding \(do\_sample=False\), max reasoning tokens 128, and max SID tokens 20\.
### G\.2Parameter Analysis
Table 10:Extended results for different numbers of<pause\>tokens\. No single value dominates every dataset and metric, butk=5k=5gives the best overall tradeoff and is used in the main experiments; boldface and underlining mark the best and second\-best results\.In this subsection, we provide supplementary experiment results on parameter analysis for our proposedPauseRecpipeline\. Specifically, as shown in Table[10](https://arxiv.org/html/2606.14142#A7.T10), we provide the full pause\-count sweep behind Fig\.[4](https://arxiv.org/html/2606.14142#S6.F4)\. We observe that, the number of<pause\>tokens yielding best performance is not identical for every dataset and metric, butk=5k\{=\}5is the most robust setting: number of<pause\>tokens being 5 is best or tied for best on 9 of 12 metrics and remains close to the best result on the remaining metrics\. Using only one or three pauses is already competitive, suggesting that a small latent computation window is useful\. Increasing to ten pauses does not provide consistent additional gains and sometimes slightly hurts performance, indicating that the benefit of pause\-based reasoning saturates after a moderate number of latent steps\.Similar Articles
Enhanced and Efficient Reasoning in Large Learning Models
This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
This paper introduces the VPG-EA framework, which uses variational inference and posterior guidance to improve the reasoning efficiency of large language models by addressing the 'overthinking' phenomenon in chain-of-thought generation.
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
This paper proposes HyperGuide, a method that distills reasoning progress into a hyperbolic geometric signal to guide step-by-step generation in LLMs, improving multi-step reasoning efficiency without explicit tree search.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Introduces Latent Reward Steering (Lrs), an adaptive inference-time framework that uses sparse autoencoder latent states and a learned reward model to implicitly promote cognitive behaviors like verification and backtracking in reasoning LLMs, improving performance across multiple models and benchmarks.