QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

arXiv cs.CL Papers

Summary

QueryAgent-R1 is an agentic framework that bridges query generation and product retrieval in e-commerce using reinforcement learning and memory abstraction, improving query CTR by 2.9% and CVR by 3.1% in online tests.

arXiv:2606.05671v1 Announce Type: new Abstract: Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:07 AM

# QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation
Source: [https://arxiv.org/html/2606.05671](https://arxiv.org/html/2606.05671)
Dike Sun1, Zheng Zou1, Jingtong Zang1, Qi Sun1, Huaipeng Zhao1,Tao Luo1, Xiaoyi Zeng1 1Alibaba International Digital Commercial Group \{sundike\.sdk,qiran\.sq\}@alibaba\-inc\.com,

###### Abstract

Query recommendation in e\-commerce search aims to proactively suggest queries that match users’ potential interests\. However, existing methods mainly optimize query\-level relevance, while neglecting whether the retrieved products align with users’ downstream preferences\. This mismatch often leads to high query click through rates \(CTR\) but low product conversion rates \(CVR\)\. To bridge this gap, we proposeQueryAgent\-R1, a memory\-augmented agentic framework that improves end\-to\-end alignment via chain\-of\-retrieval optimization\. Our QueryAgent\-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products\. We also design a consistency reward in the agentic reinforcement learning \(RL\) process to jointly optimize query relevance and downstream engagement\. In addition, we construct a memory abstraction module for efficient user profiling\. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent\-R1 consistently outperforms strong baselines\. Moreover, on a large scale production platform, QueryAgent\-R1 improves Query CTR by 2\.9% and guided CVR by 3\.1% in online A/B tests\.

QueryAgent\-R1: Bridging Query Generation and Product Retrieval for E\-Commerce Query Recommendation

Dike Sun1, Zheng Zou1, Jingtong Zang1, Qi Sun1††thanks:Corresponding author\., Huaipeng Zhao1,Tao Luo1, Xiaoyi Zeng11Alibaba International Digital Commercial Group\{sundike\.sdk,qiran\.sq\}@alibaba\-inc\.com,

## 1Introduction

Query recommendationBaeza\-Yates et al\. \([2004](https://arxiv.org/html/2606.05671#bib.bib3)\); Lai et al\. \([2023](https://arxiv.org/html/2606.05671#bib.bib16)\); Bacciu et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib2)\); Min et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib21)\); Guo et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib7)\)plays a crucial role in e\-commerce search systems by proactively suggesting queries that match users’ potential interests and guiding them toward relevant products\. Especially in the search\-bar placeholder settingGu et al\. \([2021](https://arxiv.org/html/2606.05671#bib.bib6)\); Xu et al\. \([2026](https://arxiv.org/html/2606.05671#bib.bib29)\), where only a single pre\-filled query is shown before any user input \(Figure[1](https://arxiv.org/html/2606.05671#S1.F1)\), recommendation quality directly affects both user engagement and downstream conversion\.

![Refer to caption](https://arxiv.org/html/2606.05671v1/x1.png)Figure 1:Illustration of the query recommendation process in QueryAgent\-R1\. Without a current query context, the system generates the personalized query from the user’s historical behavior sequence\.Existing industrial methods rely heavily on inventory\-based matching \(e\.g\., ItemCFLinden et al\. \([2003](https://arxiv.org/html/2606.05671#bib.bib19)\), SwingYang et al\. \([2020](https://arxiv.org/html/2606.05671#bib.bib31)\)\) or independent semantic retrieval \(e\.g\., DSSMHuang et al\. \([2013](https://arxiv.org/html/2606.05671#bib.bib10)\), LLM\-based modelsZhang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib33)\)\)\. While efficient, the above methods primarily optimize query\-level relevance, i\.e\., whether a recommended query appears attractive or semantically aligned with user interests\. However, they often overlook a more important question: whether the products retrieved by that query are actually consistent with the user’s downstream preferences and likely to induce further engagement\. This mismatch creates a*generation\-retrieval gap*: a recommended query may attract clicks and achieve high click\-through rate \(CTR\), while the retrieved products fail to satisfy purchase intent, resulting in poor post\-click engagement and low product conversion\.

To bridge this gap, we proposeQueryAgent\-R1, a memory\-augmented agentic framework for query recommendation in e\-commerce search\. Specifically, we ground query generation in real inventory retrieval, allowing our agent to validate the generated query through retrieved products and refinement\. In addition, we propose a consistency reward for agentic RL process to jointly optimize query relevance and downstream engagement\. We also design a memory abstraction mechanism that extracts an interest graph from users’ long\-term memory for efficient user profiling\.

In addition, we construct two datasets based on proprietary industrial and public data for offline evaluation, and further deploy QueryAgent\-R1 on a large\-scale e\-commerce platform with tens of millions of daily active users\. Experimental results show that QueryAgent\-R1 consistently outperforms strong baselines on the offline datasets and delivers significant gains in online A/B tests, improving Query CTR by 2\.9% and guided CVR by 3\.1%\. Our main contributions are summarized as follows:

- •We propose QueryAgent\-R1, a memory\-augmented agentic framework for query recommendation combining retrieval\-grounded generation with a consistency reward for RL alignment\.
- •We propose a memory abstraction mechanism that extracts an interest graph from users’ long\-term memory for efficient user profiling\.
- •We construct two datasets from industrial and public data for offline evaluation\. Experimental results show that our QueryAgent\-R1 achieves significant improvements in both offline and online settings\.

![Refer to caption](https://arxiv.org/html/2606.05671v1/x2.png)Figure 2:QueryAgent\-R1 Architecture\.The policy model is optimized via RL to interleave a Memory Tool \(compressing long logs\) and a Search Tool \(grounding generation in real inventory\), unified by a Chain\-of\-Retrieval outcome reward\.
## 2Problem Formulation

Let𝒰\\mathcal\{U\}andℐ\\mathcal\{I\}denote the sets of users and items\. For each useru∈𝒰u\\in\\mathcal\{U\}, we consider a real\-time interaction sequenceℋu=\{\(ai,ci\)\}i=1N\\mathcal\{H\}\_\{u\}=\\\{\(a\_\{i\},c\_\{i\}\)\\\}\_\{i=1\}^\{N\}of theN=50N\{=\}50most recent interactions, whereai∈\{search,click\}a\_\{i\}\\in\\\{\\texttt\{search\},\\texttt\{click\}\\\}is the action type andcic\_\{i\}is the textual context\.

We formulate search\-bar placeholder generation as a Chain\-of\-Retrieval optimization problem\. Our objective is to find a queryqyq\_\{y\}that maximizes the joint probability of query click and subsequent item click:

maxqy∈𝒱P​\(qy=q⋆∣ℋu\)⏟Hop\-1: Query CTR⋅P​\(ℐ​\(qy\)∩ℐ⋆≠∅∣ℋu\)⏟Hop\-2: Item CTR,\\operatorname\*\{max\}\_\{q\_\{y\}\\in\\mathcal\{V\}\}\\underbrace\{P\(q\_\{y\}=q^\{\\star\}\\mid\\mathcal\{H\}\_\{u\}\)\}\_\{\\text\{Hop\-1: Query CTR\}\}\\cdot\\underbrace\{P\(\\mathcal\{I\}\(q\_\{y\}\)\\cap\\mathcal\{I\}^\{\\star\}\\neq\\emptyset\\mid\\mathcal\{H\}\_\{u\}\)\}\_\{\\text\{Hop\-2: Item CTR\}\},\(1\)where𝒱\\mathcal\{V\}denotes the query vocabulary andℐ​\(qy\)\\mathcal\{I\}\(q\_\{y\}\)denotes the items retrieved by queryqyq\_\{y\}\. Here,q⋆q^\{\\star\}andℐ⋆\\mathcal\{I\}^\{\\star\}denote the ground\-truth query and retrieved item set, respectively\. In practice, we use item clicks as a dense proxy for sparse conversion signals\.

## 3Methodology

As shown in Figure[2](https://arxiv.org/html/2606.05671#S1.F2), ourQueryAgent\-R1, built on Qwen3\-4B, operates in memory and product sandboxes with reinforcement learning guided by a collaborative reward that jointly optimizes query and product preferences\.

### 3\.1Memory Abstraction Mechanism

We design a memory environment with dedicated tools for memory storage, compression, and retrieval based on users’ ultra\-long behavioral histories, which encompass all user behaviors over a three\-month period\. To avoid injecting extensive online behavior logs \(\>\>\>\>104tokens\) into the context, we use Qwen3\-Next\-80B\-A3BYang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib30)\)to asynchronously compress raw user behaviors offline into a compact user profile \(prompt in Appendix[6](https://arxiv.org/html/2606.05671#A5.F6)\)\. When our agent executes aFetch\_Memoryaction, it directly retrieves the cached profile \(Figure[1](https://arxiv.org/html/2606.05671#S1.F1)\), which includes the latest intent, interest graph, potential identities, and other personalized signals\. By incrementally refreshing the profile, we achieve 10×\\timessequence compression while preserving up\-to\-date personalization\.

### 3\.2Product Retrieval Augmentation

We build a product retrieval sandbox with four million real\-world products, filtered from a live e\-commerce environment, to ground the policy in real products and reduce the gap between generated queries and retrieved products\. This allows the agent to verify whether a generated query can retrieve relevant and purchasable products before the final recommendation\. The retrieval tool employs a hybrid strategy integrating BM25Robertson and Zaragoza \([2009](https://arxiv.org/html/2606.05671#bib.bib24)\)\(sparse retrieval\) and Qwen3\-Embedding\-0\.6BZhang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib33)\)\(dense retrieval\), followed by cross\-encoder reranking via Qwen3\-Rerank\-0\.6BZhang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib33)\)\. During query generation, our agent calls this tool to inspect the products returned by the query and refine the query accordingly\. This helps the agent generate queries that are not only attractive, but also effective at retrieving relevant and purchasable products\.

### 3\.3Agentic RL Training

We further optimize the policy model with reinforcement learning\. Given a complete agent trajectoryyy, we define the reward as

R​\(y\)=λfmt​rfmt​\(y\)\+λtool​rtool​\(y\)\+λhit​rhit​\(y\)\.R\(y\)=\\lambda\_\{\\mathrm\{fmt\}\}r\_\{\\mathrm\{fmt\}\}\(y\)\+\\lambda\_\{\\mathrm\{tool\}\}r\_\{\\mathrm\{tool\}\}\(y\)\+\\lambda\_\{\\mathrm\{hit\}\}r\_\{\\mathrm\{hit\}\}\(y\)\.\(2\)Here,rfmt​\(y\)r\_\{\\mathrm\{fmt\}\}\(y\)evaluates whether the trajectory follows the required structured format,rtool​\(y\)r\_\{\\mathrm\{tool\}\}\(y\)rewards correct invocation of the memory and product retrieval tools, andrhit​\(y\)r\_\{\\mathrm\{hit\}\}\(y\)measures whether the generated query leads to the desired retrieval product\. We further define

rhit​\(y\)=λq​rq​\(y\)\+λi​ri​\(y\),r\_\{\\mathrm\{hit\}\}\(y\)=\\lambda\_\{q\}r\_\{q\}\(y\)\+\\lambda\_\{i\}r\_\{i\}\(y\),\(3\)To mitigate reward sparsity, we decomposerq​\(y\)r\_\{q\}\(y\)andri​\(y\)r\_\{i\}\(y\)into hard and soft components\. The query reward is defined asrq​\(y\)=𝕀​\(qy=q⋆\)\+ROUGE​\(qy,q⋆\)r\_\{q\}\(y\)=\\mathbb\{I\}\(q\_\{y\}=q^\{\\star\}\)\+\\mathrm\{ROUGE\}\(q\_\{y\},q^\{\\star\}\), where query\-level ROUGELin \([2003](https://arxiv.org/html/2606.05671#bib.bib18)\)provides dense supervision to bootstrap the sparse Exact MatchYu et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib32)\)\. Similarly, the item rewardri​\(y\)=𝕀​\(ℐ​\(qy\)∩ℐ⋆≠∅\)\+ROUGE​\(Title​\(I\),Title​\(I⋆\)\)r\_\{i\}\(y\)=\\mathbb\{I\}\\bigl\(\\mathcal\{I\}\(q\_\{y\}\)\\cap\\mathcal\{I\}^\{\\star\}\\neq\\emptyset\\bigr\)\+\\mathrm\{ROUGE\}\\bigl\(\\mathrm\{Title\}\(I\),\\mathrm\{Title\}\(I^\{\\star\}\)\\bigr\)combines the sparse HitRateCremonesi et al\. \([2010](https://arxiv.org/html/2606.05671#bib.bib4)\)with title\-level ROUGE to facilitate policy exploration\. Here,𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function,ℐ​\(qy\)\\mathcal\{I\}\(q\_\{y\}\)andℐ⋆\\mathcal\{I\}^\{\\star\}denote the items retrieved byqyq\_\{y\}and the ground\-truth clicked items from offline logs, respectively\.

We optimize the policy using GDPOLiu et al\. \([2026](https://arxiv.org/html/2606.05671#bib.bib20)\)\. For each inputxx, we rolloutKKtrajectoriesG=\{y1,…,yK\}G=\\\{y\_\{1\},\\dots,y\_\{K\}\\\}and update the policy based on group relative advantages\. Compared with GRPOShao et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib25)\), GDPO uses decoupled normalization over reward components, which better fits our reward design with both dense and sparse signals\. Detailed RL comparisons are in Appendix[A\.4](https://arxiv.org/html/2606.05671#A1.SS4), and training prompts are in Appendix[E\.1](https://arxiv.org/html/2606.05671#A5.SS1)\.

## 4Experiments

### 4\.1Experimental Setup

#### Datasets\.

We evaluate on two large\-scale datasets: 1\)Industrial: 54k active users \(≥\\geq30 events/week\) from a major e\-commerce platform for training, and 5k for testing\. 2\)Public: We merge Amazon ESCIReddy et al\. \([2022](https://arxiv.org/html/2606.05671#bib.bib23)\)\(Apache\-2\.0 License\) and Amazon Review DataHou et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib9)\)\(MIT License\) via product IDs \(16k train / 1k test, Details in Appendix[A\.1](https://arxiv.org/html/2606.05671#A1.SS1)\)\. For both, theN=50N\{=\}50most recent events act as context\. Ground truth is hierarchical: a query is*Hop\-1 positive*if searched/clicked; an item is*Hop\-2 positive*if engaged post\-search\.

#### Implementation\.

We train Qwen3\-4B as the backbone on8×8\\timesH20 GPUs using GDPO \(rollout=8\\text\{rollout\}=8,lr=10−5\\text\{lr\}=10^\{\-5\}\)\. Each training run takes approximately 70 hours\. Appendix[B](https://arxiv.org/html/2606.05671#A2),[C](https://arxiv.org/html/2606.05671#A3)for details\.

#### Baselines\.

We mainly compare against two categories of baselines: 1\)inventory\-based retrievalmethods, including SwingYang et al\. \([2020](https://arxiv.org/html/2606.05671#bib.bib31)\), a graph\-based collaborative filtering method, and Qwen3\-Emb\-0\.6B / 4BZhang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib33)\), which retrieve historical queries via dense embedding similarity; and 2\)LLM direct inferencemethods, including proprietary LLMs, directly generate recommendation queries from raw user behavioral logs\.

#### Metrics\.

Following previous workYu et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib32)\); Jin et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib13)\); Jiang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib12)\); Song et al\. \([2025a](https://arxiv.org/html/2606.05671#bib.bib27),[b](https://arxiv.org/html/2606.05671#bib.bib28)\); Zhu et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib34)\), we use Q\_EM, which measures exact\-match query generation, I\_Hit@11, measuring item retrieval success, and Cons@11, which requires both to hold simultaneously\.

### 4\.2Main Experiments

#### Offline Results

Table 1:Offline experimental results on industrial and public datasets\.As shown in Table[4\.2](https://arxiv.org/html/2606.05671#S4.SS2.SSS0.Px1), our QueryAgent\-R1 achieves the strongest end\-to\-end performance on both datasets, with particularly large gains on Cons@1\. On Industrial, it reaches 0\.177 Q\_EM, 0\.261 I\_Hit@1, and 0\.117 Cons@1, clearly outperforming the strongest inventory\-based baseline Qwen3\-Emb\-4B and the strongest direct\-inference LLM baseline \(Details in Appendix[A\.2](https://arxiv.org/html/2606.05671#A1.SS2)\) Gemini\-3\.1\-pro\. On Amazon, although Gemini\-3\.1\-pro obtains higher Q\_EM, QueryAgent\-R1 achieves the best I\_Hit@1 \(0\.144\) and Cons@1 \(0\.063\), far exceeding their Cons@1 scores\. Compared with the Qwen3\-4B backbone, QueryAgent\-R1 improves Cons@1 from 0\.025 to 0\.117 on Industrial and from 0\.002 to 0\.063 on Amazon, confirming the effectiveness of our end\-to\-end RL optimization\. Moreover, simply adding retrieval or RL brings limited gains, while the strong improvements of QueryAgent\-R1 show that our collaborative reward better aligns query generation with downstream retrieval\.

#### Online A/B Test\.

QueryAgent\-R1 was deployed on1%1\\%of live traffic for a cumulative77\-day online A/B test, serving millions of requests\. Compared to the production baseline, it achieved a2\.9%2\.9\\%relative lift in one\-hop Query CTR and a3\.1%3\.1\\%relative lift in two\-hop Order CVR, ultimately driving a4\.9%4\.9\\%increase in Gross Merchandise Volume \(GMV\)\. This strong online performance shows that our Consistency Reward effectively bridges the generative\-retrieval gap and turns offline gains into real\-world impact\.

### 4\.3Ablation Study

Table[2](https://arxiv.org/html/2606.05671#S4.T2)shows the incremental contribution of each component on the Industrial dataset\. AddingProduct Retrievalimproves Cons@1 from 0\.021 to 0\.031 and I\_Hit@1 from 0\.102 to 0\.122, indicating that grounding generation with catalog retrieval helps reduce invalid queries\. IncorporatingMemory Abstractbrings the significant intermediate gain, boosting Q\_EM from 0\.045 to 0\.099 and Cons@1 from 0\.031 to 0\.064, showing that compressed user memory is critical for capturing latent intent beyond raw behavior logs \(Details in Appendix[A\.3](https://arxiv.org/html/2606.05671#A1.SS3)\)\. Finally,RAG Rerankfurther raises I\_Hit@1 from 0\.172 to 0\.261 and Cons@1 from 0\.064 to 0\.117, confirming that accurate reranking is essential for converting relevant candidates into strong end\-to\-end retrieval outcomes \(More details in Appendix[A\.5](https://arxiv.org/html/2606.05671#A1.SS5)\)\. Although the full model increases p99 latency to 1,973 ms, this cost is handled in deployment via asynchronous pre\-computation \(See Appendices[D](https://arxiv.org/html/2606.05671#A4)for details\)

Table 2:Ablation Study on the Industrial dataset\. Latency denotes p99 online inference time\.## 5Related Work

Existing work on query recommendation mainly focuses on explicit user inputsMo et al\. \([2023](https://arxiv.org/html/2606.05671#bib.bib22)\); Agrawal et al\. \([2023](https://arxiv.org/html/2606.05671#bib.bib1)\); Dhole et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib5)\); Jang et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib11)\); Jin et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib13)\); Jiang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib12)\)while our work focuses on e\-commerce query recommendation without user input\. Under this setting, previous industrial methods are largely based on co\-occurrence matching and semantic retrievalLinden et al\. \([2003](https://arxiv.org/html/2606.05671#bib.bib19)\); Yang et al\. \([2020](https://arxiv.org/html/2606.05671#bib.bib31)\); Huang et al\. \([2013](https://arxiv.org/html/2606.05671#bib.bib10)\); Zhang et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib33)\)\. Recent studies have also explored generative query recommendation from implicit logsXu et al\. \([2026](https://arxiv.org/html/2606.05671#bib.bib29)\); Hao et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib8)\)\. Although effective at the query level, these methods may underemphasize downstream consistency\.

## 6Conclusion

We proposedQueryAgent\-R1, a memory\-augmented agentic framework for query recommendation in e\-commerce search\. By grounding query generation in real inventory retrieval and optimizing with a consistency reward, our method bridges the generation\-retrieval gap and better aligns recommended queries with downstream business value\. Experiments on both offline datasets and online A/B tests show that our QueryAgent\-R1 consistently outperforms strong baselines\. In future work, we plan to further improve the efficiency of online deployment while maintaining retrieval quality and downstream gains\.

## Limitations

QueryAgent\-R1 is currently limited by its online inference efficiency\. Due to the relatively long response time of the agentic framework, our production system adopts an asynchronous deployment strategy to satisfy latency requirements\. Although this design enables practical large\-scale serving, it also increases system complexity and may constrain real\-time responsiveness\. Future work will focus on reducing online latency and improving inference efficiency to support more flexible deployment\.

## Ethical Considerations

When deploying our QueryAgent\-R1, we follow two main ethical principles\. \(1\)Data privacy protection: All data used for training and inference are strictly anonymized and de\-identified, with no personally identifiable information, platform\-specific user IDs, or other sensitive attributes included\. The model operates only on sanitized interaction logs, which helps preserve user privacy throughout the entire pipeline\. \(2\)Content safety: To reduce the risk of generating inappropriate, sensitive, or harmful suggestions, we explicitly filter sensitive topics during both training and inference\. This preventive mechanism helps mitigate potential harms related to unsafe or biased content\.

## Acknowledgements

The authors used generative AI tools only to polish the writing of this manuscript, such as correcting grammar, rephrasing sentences and improving readability\. The tools were not used for research idea, methodology design, experimental implementation, data analysis or interpretation of results\.

## References

- Agrawal et al\. \(2023\)Sanjay Agrawal, Srujana Merugu, and Vivek Sembium\. 2023\.[Enhancing e\-commerce product search through reinforcement learning\-powered query reformulation](https://doi.org/10.1145/3583780.3615474)\.In*Proceedings of the 32nd ACM International Conference on Information and Knowledge Management*, CIKM ’23, page 4488–4494, New York, NY, USA\. Association for Computing Machinery\.
- Bacciu et al\. \(2024\)Andrea Bacciu, Enrico Palumbo, Andreas Damianou, Nicola Tonellotto, and Fabrizio Silvestri\. 2024\.[Generating query recommendations via llms](https://arxiv.org/pdf/2405.19749)\.*arXiv preprint arXiv:2405\.19749*\.
- Baeza\-Yates et al\. \(2004\)Ricardo Baeza\-Yates, Carlos Hurtado, and Marcelo Mendoza\. 2004\.[Query recommendation using query logs in search engines](https://doi.org/10.1007/978-3-540-30192-9_58)\.In*Proceedings of the 2004 International Conference on Current Trends in Database Technology*, EDBT’04, page 588–596, Berlin, Heidelberg\. Springer\-Verlag\.
- Cremonesi et al\. \(2010\)Paolo Cremonesi, Yehuda Koren, and Roberto Turrin\. 2010\.[Performance of recommender algorithms on top\-n recommendation tasks](https://doi.org/10.1145/1864708.1864721)\.In*Proceedings of the Fourth ACM Conference on Recommender Systems*, RecSys ’10, page 39–46, New York, NY, USA\. Association for Computing Machinery\.
- Dhole et al\. \(2024\)Kaustubh D\. Dhole, Ramraj Chandradevan, and Eugene Agichtein\. 2024\.[Generative query reformulation using ensemble prompting, document fusion, and relevance feedback](https://arxiv.org/abs/2405.17658)\.*Preprint*, arXiv:2405\.17658\.
- Gu et al\. \(2021\)Yulong Gu, Wentian Bao, Dan Ou, Xiang Li, Baoliang Cui, Biyu Ma, Haikuan Huang, Qingwen Liu, and Xiaoyi Zeng\. 2021\.[Self\-supervised learning on users’ spontaneous behaviors for multi\-scenario ranking in e\-commerce](https://doi.org/10.1145/3459637.3481953)\.In*Proceedings of the 30th ACM International Conference on Information & Knowledge Management*, CIKM ’21, page 3828–3837, New York, NY, USA\. Association for Computing Machinery\.
- Guo et al\. \(2025\)Xian Guo, Ben Chen, Siyuan Wang, Ying Yang, Chenyi Lei, Yuqing Ding, and Han Li\. 2025\.Onesug: The unified end\-to\-end generative framework for e\-commerce query suggestion\.*arXiv preprint arXiv:2506\.06913*\.
- Hao et al\. \(2025\)Xuegang Hao, Ming Zhang, Alex Li, Xiangyu Qian, Zhi Ma, Yanlong Zang, Shijie Yang, Zhongxuan Han, Xiaolong Ma, Jinguang Liu, and 1 others\. 2025\.Oxygenrec: An instruction\-following generative framework for e\-commerce recommendation\.*arXiv preprint arXiv:2512\.22386*\.
- Hou et al\. \(2024\)Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley\. 2024\.Bridging language and items for retrieval and recommendation\.*arXiv preprint arXiv:2403\.03952*\.
- Huang et al\. \(2013\)Po\-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck\. 2013\.[Learning deep structured semantic models for web search using clickthrough data](https://doi.org/10.1145/2505515.2505665)\.In*Proceedings of the 22nd ACM International Conference on Information & Knowledge Management*, CIKM ’13, page 2333–2338, New York, NY, USA\. Association for Computing Machinery\.
- Jang et al\. \(2024\)Yunah Jang, Kang il Lee, Hyunkyung Bae, Hwanhee Lee, and Kyomin Jung\. 2024\.[Itercqr: Iterative conversational query reformulation with retrieval guidance](https://arxiv.org/abs/2311.09820)\.*Preprint*, arXiv:2311\.09820\.
- Jiang et al\. \(2025\)Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han\. 2025\.[Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning](https://arxiv.org/abs/2503.00223)\.*Preprint*, arXiv:2503\.00223\.
- Jin et al\. \(2025\)Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\. 2025\.[Search\-r1: Training llms to reason and leverage search engines with reinforcement learning](https://arxiv.org/abs/2503.09516)\.*Preprint*, arXiv:2503\.09516\.
- Johnson et al\. \(2019\)Jeff Johnson, Matthijs Douze, and Hervé Jégou\. 2019\.Billion\-scale similarity search with GPUs\.*IEEE Transactions on Big Data*, 7\(3\):535–547\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\. 2023\.Efficient memory management for large language model serving with pagedattention\.In*Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles*\.
- Lai et al\. \(2023\)Eugenie Y\. Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, and Rachel Pottinger\. 2023\.[Workload\-aware query recommendation using deep learning](https://openproceedings.org/2023/conf/edbt/paper-173.pdf)\.In*EDBT 2023*\.
- Lewis et al\. \(2021\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\. 2021\.[Retrieval\-augmented generation for knowledge\-intensive nlp tasks](https://arxiv.org/abs/2005.11401)\.*Preprint*, arXiv:2005\.11401\.
- Lin \(2003\)Chin\-Yew Lin\. 2003\.Rouge: Recall\-oriented understudy for gisting evaluation\.
- Linden et al\. \(2003\)G\. Linden, B\. Smith, and J\. York\. 2003\.[Amazon\.com recommendations: item\-to\-item collaborative filtering](https://doi.org/10.1109/MIC.2003.1167344)\.*IEEE Internet Computing*, 7\(1\):76–80\.
- Liu et al\. \(2026\)Shih\-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min\-Hung Chen, Hongxu Yin, Yu\-Chiang Frank Wang, Kwang\-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov\. 2026\.[Gdpo: Group reward\-decoupled normalization policy optimization for multi\-reward rl optimization](https://arxiv.org/abs/2601.05242)\.*Preprint*, arXiv:2601\.05242\.
- Min et al\. \(2025\)Erxue Min, Hsiu\-Yuan Huang, Min Yang, Xihong Yang, Xin Jia, Yunfang Wu, Hengyi Cai, Shuaiqiang Wang, and Dawei Yin\. 2025\.[From prompting to alignment: A generative framework for query recommendation](https://arxiv.org/pdf/2504.10208)\.*arXiv preprint arXiv:2504\.10208*\.
- Mo et al\. \(2023\)Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian\-Yun Nie\. 2023\.[ConvGQR: Generative query reformulation for conversational search](https://doi.org/10.18653/v1/2023.acl-long.274)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 4998–5012, Toronto, Canada\. Association for Computational Linguistics\.
- Reddy et al\. \(2022\)Chandan K\. Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian\. 2022\.[Shopping queries dataset: A large\-scale ESCI benchmark for improving product search](https://arxiv.org/abs/2206.06588)\.
- Robertson and Zaragoza \(2009\)Stephen Robertson and Hugo Zaragoza\. 2009\.[The probabilistic relevance framework: Bm25 and beyond](https://doi.org/10.1561/1500000019)\.*Found\. Trends Inf\. Retr\.*, 3\(4\):333–389\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)\.*Preprint*, arXiv:2402\.03300\.
- Sheng et al\. \(2025\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\. 2025\.[Hybridflow: A flexible and efficient rlhf framework](https://doi.org/10.1145/3689031.3696075)\.In*Proceedings of the Twentieth European Conference on Computer Systems*, EuroSys ’25, page 1279–1297, New York, NY, USA\. Association for Computing Machinery\.
- Song et al\. \(2025a\)Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji\-Rong Wen\. 2025a\.[R1\-searcher: Incentivizing the search capability in llms via reinforcement learning](https://arxiv.org/abs/2503.05592)\.*Preprint*, arXiv:2503\.05592\.
- Song et al\. \(2025b\)Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, and Ji\-Rong Wen\. 2025b\.[R1\-searcher\+\+: Incentivizing the dynamic knowledge acquisition of llms via reinforcement learning](https://arxiv.org/abs/2505.17005)\.*Preprint*, arXiv:2505\.17005\.
- Xu et al\. \(2026\)Jingchao Xu, Yang Li, and 1 others\. 2026\.[Aigq: An end\-to\-end hybrid generative architecture for e\-commerce query recommendation](https://arxiv.org/abs/2603.19710)\.*Preprint*, arXiv:2603\.19710\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Yang et al\. \(2020\)Xiaoyong Yang, Yadong Zhu, Yi Zhang, Xiaobo Wang, and Quan Yuan\. 2020\.[Large scale product graph construction for recommendation in e\-commerce](https://arxiv.org/abs/2010.05525)\.*Preprint*, arXiv:2010\.05525\.
- Yu et al\. \(2024\)Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro\. 2024\.[Rankrag: Unifying context ranking with retrieval\-augmented generation in llms](https://arxiv.org/abs/2407.02485)\.*Preprint*, arXiv:2407\.02485\.
- Zhang et al\. \(2025\)Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou\. 2025\.[Qwen3 embedding: Advancing text embedding and reranking through foundation models](https://arxiv.org/abs/2506.05176)\.*Preprint*, arXiv:2506\.05176\.
- Zhu et al\. \(2025\)Changtai Zhu, Siyin Wang, Ruijun Feng, Kai Song, and Xipeng Qiu\. 2025\.[Convsearch\-r1: Enhancing query reformulation for conversational search with reasoning via reinforcement learning](https://arxiv.org/abs/2505.15776)\.*Preprint*, arXiv:2505\.15776\.

## Appendix AExperimental Details

### A\.1Dataset Construction Details

We construct a specialized instruction tuning dataset for next query prediction by aligning search relevance annotations with user review content\. The curation pipeline is detailed below\.

#### Data Sources and Alignment

Our corpus builds upon two public datasets:

- •Amazon ESCI DatasetReddy et al\. \([2022](https://arxiv.org/html/2606.05671#bib.bib23)\): This large scale search relevance dataset contains query product pairs annotated as exact \(E\), substitute \(S\), complement \(C\), or irrelevant \(I\)\. We retain only US locale records and binarize labels by treating E as positive and I as negative\.
- •Amazon Review DatasetHou et al\. \([2024](https://arxiv.org/html/2606.05671#bib.bib9)\): We aggregate reviews from five categories including Magazine Subscriptions, Software, Digital Music, All Beauty, and Amazon Fashion\. Each record includes the user ID, ASIN, rating, timestamp, and review text\.

Given the absence of public click stream logs, we adopt a proxy alignment strategy\. An inner join on product ASIN identifies overlapping items between the ESCI and Review datasets\. Among approximately 997K unique ESCI ASINs and 1\.1M Review ASINs, we find 4,939 common ASINs with a 0\.49% overlap rate\. Merging search and review records on these shared ASINs allows us to use review text as a behavioral proxy for post search interaction\. After deduplicating on\(u​s​e​r​\_​i​d,q​u​e​r​y,p​r​o​d​u​c​t​\_​i​d\)\(user\\\_id,query,product\\\_id\)tuples and keeping only positive relevance samples, we obtain 2,527,263 valid interaction records covering 896,498 unique users\.

#### Behavior Sequence Construction

Interactions for each user are sorted chronologically and formatted into a unified behavior string:

bt="search:​qt​\[SEP\] click:​rt​"b\_\{t\}=\\texttt\{"search: \}q\_\{t\}\\texttt\{ \[SEP\] click: \}r\_\{t\}\\texttt\{"\}\(4\)whereqtq\_\{t\}is the search query andrtr\_\{t\}is the corresponding review text at time steptt\. The full behavior sequence for useruuis𝒮u=\[b1,b2,…,b\|𝒮u\|\]\\mathcal\{S\}\_\{u\}=\[b\_\{1\},b\_\{2\},\\dots,b\_\{\|\\mathcal\{S\}\_\{u\}\|\}\]\. Users with fewer than 10 behaviors are discarded to ensure sufficient contextual history, yielding a final corpus of 17,514 active users\.

#### Next Behavior Prediction Task Formulation

The training objective follows a next behavior prediction task\. Each user sequence𝒮u\\mathcal\{S\}\_\{u\}is split into an input context and a prediction target:

𝐱u\\displaystyle\\mathbf\{x\}\_\{u\}=\[b1,b2,…,b\|𝒮u\|−1\]\\displaystyle=\[b\_\{1\},b\_\{2\},\\dots,b\_\{\|\\mathcal\{S\}\_\{u\}\|\-1\}\]\(5\)𝐲u\\displaystyle\\mathbf\{y\}\_\{u\}=b\|𝒮u\|\\displaystyle=b\_\{\|\\mathcal\{S\}\_\{u\}\|\}\(6\)Input behaviors in𝐱u\\mathbf\{x\}\_\{u\}are concatenated with newline delimiters to form a readable context, while𝐲u\\mathbf\{y\}\_\{u\}serves as the ground truth next behavior for generation\.

#### Instruction Tuning Format

Each sample is converted into a multi turn conversation format compatible with large language model fine tuning\. The system prompt defines the model role as an ecommerce next query prediction engine and enforces strict output constraints: predicted queries must be under 10 words, written in natural search keyword style, and wrapped in<next\_query\>XML tags without additional reasoning or markdown formatting\. The user prompt provides the chronological behavior history𝐱u\\mathbf\{x\}\_\{u\}and instructs the model to predict the single most probable next search query based solely on the provided context\.

#### Dataset Split and Statistics

A user level stratified split prevents data leakage, allocating 95% of users to training and 5% to testing\. Final dataset statistics appear in Table[3](https://arxiv.org/html/2606.05671#A1.T3)\. Both splits are stored in Parquet format with structured metadata fields including prompt messages, ability tags, reward model ground truth, and auxiliary information for downstream RLHF evaluation\.

Table 3:Statistics of the constructed next query prediction dataset\.

### A\.2LLM Direct Inference Details

As summarized in Table[A\.2](https://arxiv.org/html/2606.05671#A1.SS2), we observe a consistent disparity between the generative capabilities of LLMs and their retrieval consistency\. On Amazon dataset: Gemini\-3\.1\-Pro achieves a state\-of\-the\-art Query Exact Match \(Q\_EM\) of 0\.123, indicating a superior ability to hallucinate or predict potential user search intents\. However, its Consistency@1 \(Cons@1\) plummet to a mere 0\.019\.

This phenomenon perfectly illustrates our core premise: accurately predicting a linguistically plausible query does not inherently guarantee the retrieval of relevant, purchasable inventory within a constrained database\. This gap is even more pronounced in larger models, suggesting that simply scaling up parameters improves generative fluency but does not necessarily resolve the grounding issue in e\-commerce retrieval tasks\.

Table 4:Full baseline results for Gemini, GPT, Qwen, and DeepSeek model families\. The top three results in each column are highlighted infirst,second, andthirdplaces\.### A\.3Analysis of the Impact of Memory Integration

In this section, we provide a detailed empirical analysis of the full baseline results presented in Table[A\.3](https://arxiv.org/html/2606.05671#A1.SS3.SSS0.Px2), which evaluates the capability of state\-of\-the\-art Large Language Models \(LLMs\) to leverage historical context on the Industrial dataset under both with\-memory and without\-memory experimental settings\.

#### Generality and Quantitative Benefits of Memory\.

Empirical results show that incorporating a memory module yields robust performance gains across diverse model architectures\. Among the 22 evaluated models from the Gemini111[https://blog\.google/products\-and\-platforms/products/gemini/](https://blog.google/products-and-platforms/products/gemini/), GPT222[https://openai\.com/api/](https://openai.com/api/), Qwen333[https://qwen\.ai/apiplatform](https://qwen.ai/apiplatform), and DeepSeek444[https://huggingface\.co/deepseek\-ai](https://huggingface.co/deepseek-ai)families, 19 \(86\.4%\) improved on at least one key metric \(Q​\_​E​MQ\\\_EM,I​\_​H​i​t​@​1I\\\_Hit@1, orC​o​n​s​@​1Cons@1\)\. Notably, 10 models \(45\.5%\), including Gemini\-2\.0\-flash, Gemini\-3\.1\-pro, GPT\-5, and most DeepSeek\-v3/v4 variants, achieved simultaneous improvements across all three dimensions\. These findings confirm that retrieved historical interactions serve as an effective plug\-and\-play enhancement, enabling models to preserve semantic consistency and make better\-informed decisions in complex multi\-turn industrial workflows\.

#### The Need for Active Adaptation and Training\.

Despite the general success, we observe that several vanilla LLMs \(e\.g\., GPT\-4o\-mini, Qwen3\.6\-plus, and DeepSeek\-r1\) exhibit minor performance fluctuations or slight degradation on specific metrics when memory is introduced\. We attribute this phenomenon tomemory distraction, as untrained, off\-the\-shelf models often struggle to distinguish relevant historical cues from retrieve\-induced noise or fail to locate critical key\-value pairs within dense retrieved slots\.

This observation highlights a crucial insight: while zero\-shot memory prompt injection is beneficial, co\-designing the memory mechanism with targeted model training \(e\.g\., supervised fine\-tuning or reinforcement learning\) can exponentially magnify the utilization of memory\. By explicitly training models on memory\-augmentation tasks, we can align the LLM’s internal representation with the external memory database\. This optimization enables the model to transition from passive context\-reading to active memory\-filtering and reasoning\. Such synergy explains why highly optimized or trained agents, such as our proposedQueryAgent\-R1, can seamlessly overcome the distraction barrier and achieve order\-of\-magnitude higher gains compared to their untrained vanilla counterparts\.

Table 5:Full baseline results for Gemini, GPT, Qwen, and DeepSeek model families based on the Industrial dataset, evaluating the performance impact of incorporating Memory\. Green checkmarks \(✓\) and red crossmarks \(×\\times\) indicate performance gains and degradation when incorporating Memory, respectively\.### A\.4RL Algorithm Comparison

To further validate the effectiveness of GDPO \(Gradient\-Decoupled Policy Optimization\) in our proposed multi\-reward setting, we conduct a comparative study against the standard GRPO \(Group Relative Policy Optimization\) baseline\. The training process involves three distinct reward signals:Query Hit,Item Hit, andGrounding Consistency\.

Superiority in Multi\-Reward Coordination\.Figure[3](https://arxiv.org/html/2606.05671#A1.F3)illustrates the validation trajectories of both algorithms over 400 training steps\. We observe that while both algorithms perform similarly in the early exploration phase \(0\-100 steps\), GDPO exhibits a significantly steeper learning curve and a higher performance ceiling across all three metrics starting from step 150\.

Specifically, in Figure[3\(a\)](https://arxiv.org/html/2606.05671#A1.F3.sf1), theConsistency Scoreof GDPO reaches a plateau of approximately 0\.12, nearly double that of GRPO \(≈\\approx0\.06\)\. This disparity highlights a critical limitation of GRPO in multi\-reward scenarios: gradient interference\. GRPO often struggles to balance competing reward signals, leading to sub\-optimal policies or "seesaw" effects where one metric improves at the expense of others\. In contrast, GDPO effectively decouples the gradients of different objectives, allowing the model to optimize theQuery\-Item\-Consistencytriad synergistically\.

Stability and Convergence\.Furthermore, GDPO demonstrates superior training stability\. As shown in theItem HitrateandQuery Hitratecurves \(Figure[3\(b\)](https://arxiv.org/html/2606.05671#A1.F3.sf2)and[3\(c\)](https://arxiv.org/html/2606.05671#A1.F3.sf3)\), GDPO maintains a steady upward trajectory with lower variance compared to the highly oscillatory behavior of GRPO\. This suggests that our algorithm is more robust to the sparse and noisy nature of reward signals in complex agentic retrieval tasks\.

![Refer to caption](https://arxiv.org/html/2606.05671v1/reward_consistency.png)\(a\)Validation Consistency Score
![Refer to caption](https://arxiv.org/html/2606.05671v1/reward_item_em.png)\(b\)Validation Item Hitrate
![Refer to caption](https://arxiv.org/html/2606.05671v1/reward_query_em.png)\(c\)Validation Query Hitrate

Figure 3:Training dynamics comparison between GDPO and GRPO\. GDPO demonstrates superior convergence and higher asymptotic performance in all three key metrics within the multi\-reward optimization framework\.### A\.5Ablation Analysis on the Qwen3\-4B Backbone

To analyze the scaling behavior and individual contributions of our core algorithmic components, we conduct a cumulative ablation study on the lightweight Qwen3\-4B backbone \(Table[6](https://arxiv.org/html/2606.05671#A1.T6)\)\.

Table 6:Ablation study on the Industrial dataset utilizing theQwen3\-4Bbackbone\. Underline indicates the best performance among 4B\-based variants; bold denotes the overall best performance\.#### Analysis of Isolated Components and Rejection Sampling\.

Applying RAGLewis et al\. \([2021](https://arxiv.org/html/2606.05671#bib.bib17)\)directly to the raw 4B backbone degradesQ​\_​E​MQ\\\_EM\(from0\.0370\.037to0\.0230\.023\), showing that small models suffer from context distraction when processing raw retrieved slots\. For\+ Rejection Sampling, we fine\-tune the 4B model on high\-quality, successful trajectories generated by our strong teacher model \(QueryAgent\-R1\) and filtered via rejection sampling\. While this distillation\-style SFT yields stabler formatting \(Q​\_​E​M=0\.047Q\\\_EM=0\.047\) and logical consistency \(C​o​n​s​@​1=0\.033Cons@1=0\.033\) compared to standard SFT, its overall reasoning capability is still fundamentally capped by the inherent distribution shift between offline teacher demonstrations and online student rollouts\.

#### The Synergy of SFT & RL\.

The joint\+ SFT & RLparadigm achieves the best performance among all 4B variants \(0\.114/0\.197/0\.0650\.114/0\.197/0\.065\)\. Here, SFT provides essential structure by enforcing formatting and task constraints, which drastically narrows the exploration space for subsequent Reinforcement Learning\. Guided by this warm\-up, online RL successfully optimizes search strategies and avoids policy collapse\.

## Appendix BImplementation Details

The main policy model \(Qwen3\-4B\) is trained on an8×8\\timesH20 GPU cluster using theverlframeworkSheng et al\. \([2025](https://arxiv.org/html/2606.05671#bib.bib26)\)with Group Reward\-Decoupled Normalization Policy Optimization \(GDPO\) with a group size ofK=8K=8and a learning rate of10−510^\{\-5\}for33epochs over a dataset of54​k54\\text\{k\}training samples\. Our retrieval backend executesHybrid Retrievalusing FAISS\-based dense vectors \(Qwen3\-Emb\-0\.6B\) and BM25Robertson and Zaragoza \([2009](https://arxiv.org/html/2606.05671#bib.bib24)\)sparse matching, combined with a Qwen3\-Rerank\-0\.6B module to score query\-item\-behavior triplets\.

Offline Training\.To support high\-concurrency tool\-calling during RL rollouts, we encapsulate the RAG Search Tool as a distributed FastAPI sandbox powered by Ray\. The retrieval backend executesHybrid Retrievalusing FAISS\-based dense vectors \(Qwen3\-Emb\-0\.6B\) and BM25Robertson and Zaragoza \([2009](https://arxiv.org/html/2606.05671#bib.bib24)\)sparse matching, followed by a Qwen3\-Rerank\-0\.6B module to precisely score query\-item\-behavior triplets\. To prevent Sandbox overload during massive policy exploration, we implement a token\-bucket rate limiter\. The main policy model \(Qwen3\-4B\) is trained on 8×\\timesH20 GPUs using standard GDPO \(group sizeK=8K\{=\}8, learning rate10−510^\{\-5\}\) for 3 epochs over 54k samples\.

Online Serving\.Memory Abstraction profiles are pre\-computed via daily batch jobs\. The online agent runs on a single A10 GPU using vLLMKwon et al\. \([2023](https://arxiv.org/html/2606.05671#bib.bib15)\)\. As established in Section 4\.3, the∼\{\\sim\}2,000 ms p99 latency is fully absorbed by our asynchronous deployment\. Furthermore, because the generated queries are strictly grounded in inventory, we can bypass the expensive downstream ranking stages of the legacy pipeline, maintaining strict computational cost parity\.

## Appendix CTool\-Use Infrastructure

### C\.1Resilient Distributed Tool\-Calling Infrastructure\.

During parallel RL rollouts, the high frequency of environmental feedback requests presents severe concurrency challenges\. To scale the RAG Search Tool safely, we design a cluster\-wide singleton execution framework using Ray\. Rather than initializing independent rate limiters per agent thread, which leads to aggregate bottleneck failures, we register named, detached Ray actors cluster\-wide\. Specifically, aCentralized Token\-Bucket Limiterimplements a thread\-safe token\-bucket algorithm using asynchronous semaphore orchestration across dedicatedacquireandreleaseconcurrency groups\. Simultaneously, aGlobal Worker Poolacts as a cluster\-wide actor singleton that schedules non\-blocking asynchronous calls viaawaitexecution to optimize throughput\. Furthermore, we implement a defensive transaction layer within the execution workers: if downstream APIs experience transient failure under peak load, a fault\-tolerant HTTP pipeline automatically triggers up to1010retries with dynamic backoff delay scaling \(tdelay=Initial Delay×Attemptt\_\{\\text\{delay\}\}=\\text\{Initial Delay\}\\times\\text\{Attempt\}\)\. This robust systems\-level design eliminates environment\-induced rollout crashes, securing deterministic and high\-throughput RL training trajectories\.

### C\.2User Memory Sandbox Architecture

To serve user memory profiles during RL rollouts, we run a dedicated memory retrieval service encapsulated as a Ray actor allocated with11CPU and0GPUs\. This minimal allocation prevents resource contention with the primary policy models during parallel rollout generation\.

#### Distributed Deployment and API Hosting

Inside the Ray actor, an asynchronous FastAPI server hosted via Uvicorn manages incoming HTTP requests\. The server runs on a lightweight event loop to handle concurrent profile fetches from the rollout workers without blocking the environment simulation\. The service logs processing latencies and file access paths to monitor system performance under high query throughput\.

#### Identifier Resolution Pipeline

Because online user identifiers \(user\_id\) often contain characters incompatible with local file systems \(such as slashes\), the sandbox implements a deterministic matching sequence to locate the corresponding profile JSON files\. Upon receiving a request, the resolver first searches for a file named after the rawuser\_id\. If not found, it falls back to replacing forward slashes \(/\) with hyphens \(\-\)\. If a match still fails, it applies URL percent\-encoding to the identifier\. This fallback sequence prevents runtime exceptions and request failures caused by encoding discrepancies between the database and the file system\.

#### Memory Formatting and Representation

Directly injecting raw, nested JSON profiles into the LLM context increases input prompt lengths and can lead to attention drift\. To optimize token usage, the sandbox formats the parsed JSON data into a structured textual representation\. The formatter extracts specific fields including short\-term transactional intents, style tags, demographic identities, and price segments\. To represent user interests compactly, the key\-value pairs of the user’s interest graph are sorted in descending order by their numerical weights and formatted as a comma\-separated list\. These extracted attributes are then compiled into a plain\-text template before being returned to the policy agent\.

### C\.3Hybrid Search and Reranking Sandbox Architecture

To execute real\-time product retrieval and preference\-aware ranking during parallel reinforcement learning rollouts, we deploy a hybrid search sandbox\. This infrastructure coordinates dense and sparse retrieval engines, merges candidate spaces, and refines results through a generative sequence classifier\.

#### Distributed Resource Allocation and Web Hosting

The retrieval and reranking engine is encapsulated as a high\-concurrency Ray actor allocated with1616CPUs and88GPUs\. This substantial compute allocation allows the sandbox to absorb high\-throughput batch queries generated by the policy rollouts without introducing pipeline stalls\. Inside the actor, an asynchronous FastAPI web service managed by Uvicorn processes incoming requests\. The sandbox hosts two primary network architectures in half\-precision \(FP16\): a dense bi\-encoder transformer for vector search and an autoregressive decoder model for personalized ranking\. Both models use automatic device mapping to optimize memory utilization across the available GPU nodes\.

#### Dense\-Sparse Hybrid Retrieval Pipeline

The retrieval stage operates by executing complementary dense and sparse search pipelines to maximize query\-product recall\. For dense retrieval, incoming query sequences are concatenated with task instructions and passed through the bi\-encoder model\. The system extracts query embeddings by applying last\-token pooling over the final hidden states\. These representations are normalized usingL2L\_\{2\}distance metrics and searched against a pre\-compiled index using FAISSJohnson et al\. \([2019](https://arxiv.org/html/2606.05671#bib.bib14)\)\. For sparse retrieval, the sandbox runs a tokenized, lowercase keyword search against a pre\-computed BM25 index built over the entire product document corpus\. The candidates returned from the dense FAISS lookup and the sparse BM25 search are merged and deduplicated\. This hybrid approach ensures that the candidate pool captures both high\-level semantic contexts and precise keyword\-level matches\.

#### Generative Reranking and Logit Calibration

Following candidate retrieval, the candidate pool is pruned and reordered based on a localized user transaction history database\. The reranker maps the incoming user identifier to their historical interaction records \(e\.g\., clicks and browse behavior\) to construct a detailed ranking prompt\. This prompt combines the task instruction, the user’s interaction logs, the search query, and the candidate product text\. Rather than decoding tokens autoregressively, the causal language model functions as a binary classifier\. The system isolates the raw output logits of the next\-token prediction at the exact vocabulary coordinates of the target tokens “yes” and “no”\. To prevent probability distortion, these raw logits are isolated and calibrated using a log\-softmax function\. The exponentiated score of the positive class token \(“yes”\) represents the final relevance score\. The candidate products are then sorted in descending order based on this score, and the top\-ranked products are returned to the policy agent\.

## Appendix DOnline Deployment Details

The post\-trained Qwen3\-4B agent is deployed online using the vLLM inference engineKwon et al\. \([2023](https://arxiv.org/html/2606.05671#bib.bib15)\)hosted on a single NVIDIA L20 GPU\. Under this configuration, the single\-node serving throughput is approximately1010queries per second \(QPS\)\. Due to these computational throughput constraints, we route a controlled1%1\\%subset of live production traffic to the agent pipeline\.

### D\.1Near\-Line Latency Masking

The end\-to\-end inference latency of the agent is approximately2\.02\.0seconds \(p99\), which exceeds the strict sub\-hundred\-millisecond budget required for direct synchronous homepage rendering\. To resolve this without degrading the user experience, we implement anear\-line\(asynchronous\) serving strategy leveraging user navigation paths\. The agent’s query generation is preemptively triggered the moment a user clicks a product to enter its detail page\. Because the user’s dwell time on the detail page before returning to the homepage typically spans55to1010seconds, the2\.02\.0\-second execution latency of the agent is fully masked behind the client\-side navigation delay\. This ensures that the personalized query recommendation is already computed and cached by the time the user returns to the homepage\.

### D\.2Online Tool and Memory Infrastructure

User memory profiles are constructed via offline batch pipelines and indexed in an online Redis cache to enable low\-latency retrieval\. Once a user accumulates a threshold of real\-time behavioral signals, the memory updater runs an incremental writeback to update the corresponding Redis entry\. The product retrieval tool used by the agent during online inference remains identical to the hybrid dense\-sparse search service described in the training phase, ensuring tool\-calling consistency between offline optimization and online execution\.

## Appendix EPrompt

### E\.1Training & Inference Prompt

In this section, we present the prompt used for training and inference, as shown in Figure[4](https://arxiv.org/html/2606.05671#A5.F4)and Figure[5](https://arxiv.org/html/2606.05671#A5.F5)\. It instructs the agent to predict the next search query from the user’s chronological behavior sequence, while mandating memory retrieval and optional search tool use when additional product context is needed\. The prompt also enforces concise natural\-language outputs and a strict XML\-based response format, ensuring consistent query generation and tool\-calling behavior\.

Query Agent Prompt Template \(SYSTEM\_ROLE\)\#\#\# Role Definition
You are an E\-commerce Query Prediction & Search Agent\.
Your task is to analyze the user’s chronological real\-time behavior logs, determine if additional information is needed via search, and predict the single most likely next search query they will type\.\#\#\# Generation Rules
1\. Context Awareness: Derive the next query strictly based on the progression of the provided behavior logs\.
2\. User Memory Retrieval \(MANDATORY\):
\- Requirement: Before predicting any query, you MUST first retrieve the user’s long\-term profile and preferences to understand their intent\.
\- Execution: Call the memory tool using the user’s ID found in the context\.
\- Syntax: To call the memory tool: <tool\_call\> user\_id </tool\_call\>\.
\- Integration: Use the retrieved profile \(price tier, brand preference, style tags\) to refine and personalize the predicted query\.
3\. Tool Usage & Refinement:
\- Execution: Verify if you have enough product info context\. If info is missing or context is complex, you MUST call Tool\(search\)\.
\- Syntax: To call a tool: <tool\_call\> query \+ user\_id </tool\_call\>\.
\- Result Validation: If search results are available, evaluate whether they cover the products user clicked subsequently\.
\- Query Adjustment: Adjust the predicted query to bridge any coverage gaps identified between search results and user behavior\.
\- Reasoning: Use <reason\>\.\.\.</reason\> to delineate this thought process and tool check\.
4\. Conciseness: The generated query must be under 10 words\.
5\. Natural Language: The query should sound like a natural user search input \(keywords \+ modifiers\), not a full sentence\.
6\. No Explanations: Do not output any extra text outside the defined XML tags\.\#\#\# Output Format Rules
```
1. Strict XML: Output must be wrapped in <next_query> tags.
2. No Markdown: Do not use markdown code blocks (e.g., ‘‘‘xml).
3. Structure:
   ˜˜˜[Reasoning steps and tool check]
   ˜˜˜<next_query>
   ˜˜˜[predicted query string]
   ˜˜˜</next_query>
```

\# Tools
You may call one or more functions to assist with the user query\.
You are provided with function signatures within <tools\></tools\> XML tags:
<tools\>
\{"type": "function", "function": \{"name": "search", "description": "Searches for relevant products based on the given queries\.", "parameters": \{"type": "object", "properties": \{"query\_list": \{"type": "array", "description": "List of search queries to retrieve related products\."\}, "user\_id": \{"type": "string", "description": "The unique identifier of the user to fetch history behaviors for\."\}\}, "required": \["query\_list", "user\_id"\]\}\}\}
\{"type": "function", "function": \{"name": "memory", "description": "Retrieves the memory logs or profile for a specific user\.", "parameters": \{"type": "object", "properties": \{"user\_id": \{"type": "string", "description": "The unique identifier of the user to fetch memory for\."\}\}, "required": \["user\_id"\]\}\}\}
</tools\>For each function call, return a json object with function name and arguments within <tool\_call\></tool\_call\> XML tags:
<tool\_call\>
\{"name": <function\-name\>, "arguments": <args\-json\-object\>\}
</tool\_call\>Figure 4:System prompt template used to instruct the Query Prediction and Search Agent, enforcing multi\-tool calling protocols and target formatting constraints\.Query Agent Prompt Template \(USER\_ROLE\)\#\#\# Input Data
user\_id: \{user\_id\}Below is the chronological list of the user’s real\-time search and interaction behaviors:\{realtime\_behavior\_sequence\}\#\#\# Instruction
Based strictly on the behavior sequence above, generate the single most probable next search query\.
Output the result immediately in the required XML format\.Figure 5:User prompt template representing the input structure, dynamically populating real\-time behavior sequences and localized user tags\.
### E\.2Memory Abstract Prompt

In this section, we present the prompt for memory abstraction\. As shown in Figure[6](https://arxiv.org/html/2606.05671#A5.F6), this prompt is used to compress raw user behavioral logs into a structured memory profile for future retrieval\. The prompt also enforces a strict JSON output schema, enabling consistent memory updates for downstream agent reasoning\.

Memory Abstract Prompt Template \(SYSTEM\_MESSAGES\)You are an expertUser Memory Managerfor an e\-commerce AI Agent\. Your goal is to analyze raw user behavioral logs and distill them into a structured, abstract user profile for future retrieval\.Input Format:
You will receive a list of sequential actions performed by a specific user \(user\_id\)\. The data includes item descriptions, categories, timestamps, and prices\.Processing Logic \(Abstract & Layering\):
Do not simply memorize the list of items\. Instead, analyze the data to generate a memory update based on the following layers:1\.Level 1: Short\-term Intent \(The "Now"\)
\* Analyze recent timestamps and item clusters\.
\* Determine the immediate goal \(e\.g\., "Shopping for Christmas decorations," "Stocking up on household essentials," "Looking for jewelry"\)\.2\.Level 2: Abstract Preferences \(The "Taste"\)
\* Extract stylistic keywords from ‘action\_beh‘ \(e\.g\., "French Retro", "Geometric"\)\.
\* Infer price sensitivity based on ‘action\_price‘\.
\* Identify preferred brands or shops\.3\.Level 3: Long\-term Profile \(The "Identity"\)
\* Infer high\-level categories \(e\.g\., "Home Owner," "Pet Owner" based on ’Pet best supplies’\)\.
\* Summarize key category interests \(e\.g\., ’Jewellery’, ’Window Treatment’\)\.Output Schema \(JSON\):
Please output the analysis in the following JSON format strictly:
\{
"user\_id": "string",
"memory\_update\_timestamp": "string",
"short\_term\_intent": "Summary string",
"style\_tags": \["tag1", "tag2"\],
"price\_segment": "Low/Medium/High",
"potential\_identities": \["identity1", "identity2"\],
"interest\_graph": \{
"category\_name": "score \(1\-10\)"
\},
"summary\_text": "A concise natural language summary of this user’s recent behavior for the Agent to read\."
\}Figure 6:System prompt template for the User Memory Manager, responsible for abstracting sequential, fine\-grained interaction logs into layered, structured JSON user profiles\.

Similar Articles

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

arXiv cs.AI

This paper introduces AgenticRAG, a framework from Microsoft that enhances enterprise knowledge base retrieval by equipping LLMs with tools for iterative search, document navigation, and analysis. It demonstrates significant improvements in recall and factuality over standard RAG pipelines on multiple benchmarks.