DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

arXiv cs.LG Papers

Summary

This paper introduces DRIVE, a unified Transformer-based framework for offline auto-bidding that decouples candidate action generation from decision making, combining distributional action modeling, retrieval-augmented candidate generation, and value-based evaluation to improve bidding performance under budget and cost constraints.

arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we propose DRIVE (Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer-based methods.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:12 AM

# DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation
Source: [https://arxiv.org/html/2606.14192](https://arxiv.org/html/2606.14192)
Haochen WangShangqin MaoXun YangQianlong XieXingxing WangXuri GeYing ZhouZhiwei Xu

###### Abstract

Auto\-bidding is a core component of real\-time advertising systems, where decisions must optimize long\-term performance under budget and cost constraints, while online exploration is prohibitively risky\. Offline reinforcement learning and, more recently, Transformer\-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long\-tail traffic\. To mitigate these limitations, we proposeDRIVE\(Distributional and Retrieval\-Augmented Bidding with Value Evaluation\), a unified Transformer\-based framework that decouples candidate action generation from decision making for offline auto\-bidding\. DRIVE combines distributional action modeling, retrieval\-augmented candidate generation from high\-quality historical decisions, and value\-based evaluation to select the most promising bid at inference time\. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer–based methods\.

Machine Learning, ICML

## 1Introduction

Online advertising has become a primary channel for monetizing digital traffic, where advertisers compete for impression opportunities through real\-time bidding \(RTB\)\(Yuanet al\.,[2013](https://arxiv.org/html/2606.14192#bib.bib1); Wang and Yuan,[2015](https://arxiv.org/html/2606.14192#bib.bib2)\)\. Modern advertising platforms widely adopt auto\-bidding mechanisms\(Balseiroet al\.,[2021a](https://arxiv.org/html/2606.14192#bib.bib3),[b](https://arxiv.org/html/2606.14192#bib.bib4); Denget al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib5); Ouet al\.,[2023](https://arxiv.org/html/2606.14192#bib.bib6)\)to optimize long\-term performance while satisfying practical constraints, such as budget limits and target Cost Per Action \(CPA\)\(Heet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib7); Wuet al\.,[2018](https://arxiv.org/html/2606.14192#bib.bib8)\)\. However, the bidding environment is inherently dynamic and uncertain, making it difficult to learn robust strategies through static heuristics or online reinforcement learning \(RL\) methods due to the associated risks imposed on advertisers\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x1.png)Figure 1:Two challenges for DT\-style methods in real\-world bidding\.Left:The Average Action trap, where unimodal action modeling collapses multiple effective bidding modes into a suboptimal averaged action\.Right:Sparse data and long\-tail traffic, where current methods generate unreliable actions in low\-density regions despite the presence of high\-quality decisions in the dataset\.Given the prohibitive risk and cost of online exploration in real\-world ad auctions, offline RL\(Levineet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib28)\), such as Conservative Q\-Learning \(CQL\)\(Kumaret al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib21)\), becomes not just appealing but necessary\. It enables the learning of bidding policies solely from logged historical data without interacting with the live market\. Crucially, auto\-bidding is inherently a sequential decision\-making process, as current expenditures directly constrain future bidding capabilities\. Consequently, Transformer\-based sequence modeling approaches, such as Decision Transformer \(DT\)\(Chenet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib23)\), have shown strong potential by leveraging long\-range temporal dependencies through attention\-based architectures\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.14192#bib.bib41)\)\. Building upon this line of work, a growing body of DT\-style variants has recently been proposed and applied to advertising scenarios\(Liet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib11); Gaoet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib12)\)\. Nevertheless, directly applying such DT\-style architectures to real\-world bidding scenarios remains challenging, as illustrated in Figure[1](https://arxiv.org/html/2606.14192#S1.F1)\. A prominent challenge is the “Average Action” trap, which arises from the fact that similar market states often admit multiple distinct yet effective bidding strategies, such as comparatively high or low bids\. The unimodal or deterministic modeling in these approaches tends to collapse such diverse behaviors into suboptimal averaged actions that are neither sufficiently aggressive to secure auctions nor conservative enough to control costs\. Beyond this issue, the purely parametric nature of the current Transformer\-based methods implies the absence of explicit mechanisms for retaining high\-quality historical decisions, rendering them vulnerable to unreliable action generation under long\-tail traffic or sparse data regimes\.

To address these limitations, we proposeDRIVE\(Dis\-tributionalRetrIeval\-Augmented Bidding withValueEvaluation\), a unified Transformer\-based framework for offline auto\-bidding that combines distributional action modeling, retrieval\-augmented candidate generation, and value\-based decision making\. Unlike standard approaches, DRIVE decouples candidate action generation from decision making\. Specifically, we first model the action space with a Gaussian Mixture Model \(GMM\)\(Reynolds,[2018](https://arxiv.org/html/2606.14192#bib.bib42)\), enabling the policy to capture diverse yet effective bidding patterns\. In addition, a retrieval mechanism encodes the current state and retrieves high\-quality historical actions from similar states as supplementary candidates, providing explicit non\-parametric support and mitigating unreliable actions under sparse data regimes\. A value critic is further incorporated to evaluate both generated and retrieved candidates during inference and select the most promising bid\. Together, these components enable DRIVE to robustly and effectively perform offline bidding\. Moreover, extensive experiments across multiple settings demonstrate the effectiveness of DRIVE\.

Our main contributions are summarized as follows:

- •We proposeDRIVE, a unified Transformer\-based framework for auto\-bidding that integrates distributional action modeling, retrieval\-augmented candidate generation, and value\-based evaluation\.
- •Extensive experiments are conducted on AuctionNet\(Suet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib36)\), a representative offline bidding benchmark, demonstrating the effectiveness of DRIVE in auto\-bidding scenarios\.
- •We further demonstrate that DRIVE is broadly applicable by seamlessly integrating it into multiple DT\-style methods and consistently improving performance across a range of offline RL benchmarks\.

## 2Related Work

### 2\.1Evolution of Auto\-bidding Strategies

Early research in auto\-bidding primarily relied on static optimization or control\-theoretic frameworks\. Heuristic bidding strategies, ranging from linear\(Perlichet al\.,[2012](https://arxiv.org/html/2606.14192#bib.bib17)\)to non\-linear functions\(Zhanget al\.,[2014](https://arxiv.org/html/2606.14192#bib.bib18)\), derive bid prices based only on the predicted value of each impression, such as predicted click\-through rate \(pCTR\)\. To account for budget constraints, control\-based methods, including PID controllers\(Chenet al\.,[2011](https://arxiv.org/html/2606.14192#bib.bib19); Leeet al\.,[2013](https://arxiv.org/html/2606.14192#bib.bib24); Yanget al\.,[2019](https://arxiv.org/html/2606.14192#bib.bib26)\)and Smart Pacing\(Xuet al\.,[2015](https://arxiv.org/html/2606.14192#bib.bib25)\), were developed to smooth consumption\. However, these approaches are inherently myopic: they focus on immediate returns or predefined heuristic rules, failing to optimize for long\-term objectives in the highly stochastic auction environment\.

To overcome the myopia of static strategies, reinforcement learning \(RL\) was introduced to model bidding as a sequential decision process\.Caiet al\.\([2017](https://arxiv.org/html/2606.14192#bib.bib13)\)pioneered a model\-based framework by casting bidding as a constrained Markov decision process \(MDP\)\(Puterman,[1990](https://arxiv.org/html/2606.14192#bib.bib43)\)\. However, approaches that rely on explicit environment modeling often incur substantial computational overhead and suffer from simulation\-to\-reality discrepancies\(Wuet al\.,[2018](https://arxiv.org/html/2606.14192#bib.bib8)\)\. As a result, subsequent research shifted toward model\-free RL paradigms\(Wuet al\.,[2018](https://arxiv.org/html/2606.14192#bib.bib8)\)\. Notably,Liuet al\.\([2020](https://arxiv.org/html/2606.14192#bib.bib16)\)proposed a dynamic strategy leveraging the TD3 algorithm\(Fujimotoet al\.,[2018](https://arxiv.org/html/2606.14192#bib.bib27)\)to optimize continuous bidding factors directly, bypassing the need for complex market modeling\. In real\-world bidding systems, however, online RL is generally impractical, as exploratory actions may incur substantial financial costs\. Consequently, offline RL\(Levineet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib28)\), which learns policies solely from logged historical data, has emerged as a practical and dominant paradigm for auto\-bidding\.

### 2\.2Offline Reinforcement Learning for Auto\-bidding

Despite its practical appeal, offline RL introduces fundamental challenges in auto\-bidding scenarios\. A central issue is distribution shift, where learned policies may exploit actions that are poorly supported by the logged data, leading to unreliable value estimation and unsafe decisions\. To address this issue, prior work has proposed conservative or in\-sample learning methods, including BCQ\(Fujimotoet al\.,[2019](https://arxiv.org/html/2606.14192#bib.bib20)\), CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib21)\), and IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.14192#bib.bib22)\), which aim to mitigate overestimation on out\-of\-distribution \(OOD\) actions\(Levineet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib28)\)\.

While the above value\-based offline RL methods are effective at addressing OOD overestimation, they often struggle with long\-horizon credit assignment and complex sequential dependencies\. This limitation has motivated a paradigm shift toward reformulating RL as generative sequence modeling\(Janneret al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib29); Chenet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib23)\)\. Notably, Decision Transformer \(DT\)\(Chenet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib23)\)leverages the self\-attention mechanism to generate actions conditioned on desired future returns, effectively capturing long\-range dependencies\. Building on this framework, recent methods have sought to integrate value information into generative policies\. For example, GAVE\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib12)\)introduces value\-guided exploration during training, while GAS\(Liet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib11)\)employs post\-training search with multi\-critic voting to refine actions\. Peak\-Return Greedy Slicing\(Xuet al\.,[2026](https://arxiv.org/html/2606.14192#bib.bib50)\)offers a data\-centric paradigm for improving Transformer\-based offline RL, where high\-return subtrajectories are selected to construct more informative training sequences\. Beyond Transformer\-based models, DiffBid\(Guoet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib15)\)employs conditional diffusion to model bidding distributions\. However, aside from the prohibitive inference latency caused by iterative sampling, it struggles to effectively learn the reverse diffusion process in highly dynamic and long\-horizon environments, leading to inaccurate trajectory prediction and suboptimal policy performance\.

Despite these advances, DT\-style generative approaches remain limited in real\-world bidding scenarios\. They typically rely on unimodal regression objectives and point\-estimate decoding, which fail to capture the inherently multimodal nature of optimal bidding behaviors\. As a result, multiple distinct yet effective bidding strategies are often collapsed into averaged actions, leading to suboptimal performance\. In contrast, DRIVE explicitly models the action distribution to preserve diverse bidding modes, enabling more robust and effective decision\-making under complex and uncertain market conditions\.

### 2\.3Retrieval\-Augmented Decision Making

Retrieval\-Augmented Generation \(RAG\) was introduced in natural language processing \(NLP\) to mitigate hallucinations and outdated knowledge in parametric models by incorporating evidence retrieved from large external corpora\(Lewiset al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib30); Guuet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib33); Borgeaudet al\.,[2022](https://arxiv.org/html/2606.14192#bib.bib34)\)\. By grounding generation in retrieved documents, RAG improves both factual accuracy and interpretability in knowledge\-intensive tasks like open\-domain question answering\. Motivated by these benefits, retrieval mechanisms have recently been adopted in RL to better exploit past experience\. DT\-Mem\(Kanget al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib31)\)augments Decision Transformers with an internal memory to reduce forgetting in multi\-task settings, while RA\-DT\(Schmiedet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib32)\)retrieves relevant subtrajectories from an external index to extend context length for long\-horizon decision making\. These studies suggest that retrieval can serve as an explicit non\-parametric component, enhancing decision quality by reusing high\-quality historical experiences\. Inspired by this line of work, our method leverages retrieval to enhance the robustness of bidding decisions under sparse and long\-tail data regimes\.

## 3Preliminaries

### 3\.1RTB Environment and Optimal Bidding

Consider an advertising campaign consisting ofNNsequential impression opportunities in a real\-time bidding \(RTB\) environment with a generalized second\-price \(GSP\) auction mechanism\(Lucieret al\.,[2012](https://arxiv.org/html/2606.14192#bib.bib44)\)\. For each impressionii, the advertiser submits a bidbib\_\{i\}\. The winning outcome of the auction is represented by a binary indicatorxi∈\{0,1\}x\_\{i\}\\in\\\{0,1\\\}, and the corresponding payment is denoted bycic\_\{i\}, which equals the second\-highest bid\. Each impression is associated with a valueviv\_\{i\}, such as a click or conversion\. The advertiser aims to maximize the total accumulated value∑i=1Nvi​xi\\sum\_\{i=1\}^\{N\}v\_\{i\}x\_\{i\}subject to a total budget constraintBBand a set of key performance indicator \(KPI\) constraints, such as cost\-per\-action \(CPA\) or return on investment \(ROI\)\. This objective can be formulated as the following constrained optimization problem:

max\{xi\}i=1N\\displaystyle\\max\_\{\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}\}\\quad∑i=1Nvi​xi\\displaystyle\\sum\_\{i=1\}^\{N\}v\_\{i\}x\_\{i\}s\.t\.∑i=1Nci​xi≤B,\\displaystyle\\sum\_\{i=1\}^\{N\}c\_\{i\}x\_\{i\}\\leq B,\(1\)𝒢j​\(x1:N\)≤𝒦j,∀j\.\\displaystyle\\mathcal\{G\}\_\{j\}\(x\_\{1:N\}\)\\leq\\mathcal\{K\}\_\{j\},\\quad\\forall j\.where𝒢j​\(⋅\)\\mathcal\{G\}\_\{j\}\(\\cdot\)denotes the constraint function corresponding to thejj\-th KPI, and𝒦j\\mathcal\{K\}\_\{j\}specifies its target threshold\. Previous studies\(Heet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib7); Zhanget al\.,[2014](https://arxiv.org/html/2606.14192#bib.bib18)\)have shown that, under mild assumptions, the optimal bidding strategy can be derived from the Karush–Kuhn–Tucker \(KKT\) conditions and admits a unified affine form\. In practice, this strategy is often simplified to a scaled value\-based bidding rule:

bi∗=λ​vi,b\_\{i\}^\{\*\}=\\lambda\\,v\_\{i\},\(2\)whereλ\\lambdais a control parameter determined by the Lagrange multiplier associated with the budget and KPI constraints\. As a result, modern auto\-bidding methods commonly focus on dynamically adjustingλ\\lambdato adapt to stochastic market conditions and evolving constraints\.

### 3\.2Sequential Decision\-Making for Auto\-bidding

In this paper, the auto\-bidding problem is formulated as a sequential decision\-making task modeled by a Markov Decision Process \(MDP\)ℳ=⟨𝒮,𝒜,𝒫,ℛ,γ⟩\\mathcal\{M\}=\\langle\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{R\},\\gamma\\rangle\. An episode corresponds to a bidding cycle, typically one day, which is discretized intoTTtime steps\. At each time steptt, the agent determines a bidding control decisionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}that applies to all impressions arriving within the corresponding interval\. The policy depends on the current statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}, expressed asπ​\(at∣st\)\\pi\(a\_\{t\}\\mid s\_\{t\}\)\. State transitions follow the environment dynamics𝒫:𝒮×𝒜→𝒮\\mathcal\{P\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathcal\{S\}, and upon transitioning to the next statest\+1s\_\{t\+1\}, the environment provides a scalar rewardrtr\_\{t\}reflecting the performance contribution achieved during time steptt\.ℛ:𝒮×𝒜→ℝ\\mathcal\{R\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}is the reward function, andγ∈\(0,1\]\\gamma\\in\(0,1\]is the discount factor\.

State\.The statests\_\{t\}summarizes the contextual information available at time stepttfrom both campaign\-level and market\-level perspectives\. Campaign\-level features characterize the internal status and global constraints of the advertiser\. Market\-level features capture the external auction environment and its temporal dynamics, which are derived from aggregated historical auction observations\.

Action\.The actionata\_\{t\}specifies a bid adjustment factor, denoted by the bidding parameterλt\\lambda\_\{t\}in Equation \([2](https://arxiv.org/html/2606.14192#S3.E2)\)\. This parameter scales the predicted valuevvof each impression, which is assumed to be available from a pre\-trained prediction model, to compute the final bid price\.

Reward\.The rewardrtr\_\{t\}measures the contribution of the agent’s decisionata\_\{t\}at time stepttto the advertiser’s objective\. Typical reward definitions include the total conversion value or the number of clicks obtained during the corresponding interval, optionally combined with penalty terms to reflect budget or KPI violations\.

To leverage the sequence modeling capabilities of Transformer\-based offline RL methods, such as DT, the offline dataset is organized into trajectories of the formτ=\(R^0,s0,a0,…,R^T,sT,aT\)\\tau=\(\\hat\{R\}\_\{0\},s\_\{0\},a\_\{0\},\\dots,\\hat\{R\}\_\{T\},s\_\{T\},a\_\{T\}\)\.R^t=∑i=tTγi−t​ri\\hat\{R\}\_\{t\}=\\sum\_\{i=t\}^\{T\}\\gamma^\{i\-t\}r\_\{i\}denotes the return\-to\-go \(RTG\), which conditions action generation on desired future performance\. During training, the Transformer learns to predict actions conditioned on the RTG\-state context from offline trajectories\. At inference time, the policy generates actions autoregressively based on the current state and target RTG, enabling long\-horizon credit assignment through sequence modeling\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x2.png)Figure 2:Real failure cases in AuctionNet\. The blue line denotes the suboptimal predicted average action, the red line the optimal action, and colored points indicate dataset actions with color intensity reflecting RTG\.

## 4Methodology

In this section, we presentDRIVE, a unified Transformer\-based framework for auto\-bidding\. DRIVE extends standard Transformer\-based offline RL by incorporating three key components:\(I\)a distributional action head for capturing the multimodal bidding behaviors,\(II\)a retrieval\-augmented mechanism that grounds decisions in relevant high\-quality historical trajectories, and\(III\)a value\-based critic to decide the final action for improving robustness\. For clarity and completeness, we provide a comprehensive table of notations in Appendix[A\.1](https://arxiv.org/html/2606.14192#A1.SS1)and the detailed algorithmic workflow in Appendix[A\.2](https://arxiv.org/html/2606.14192#A1.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x3.png)

Figure 3:The DRIVE framework\. DRIVE is built upon a Transformer\-based offline RL paradigm, incorporating \(I\) a multimodal GMM\-based policy, \(II\) a retrieval index over contextual state embeddings, and \(III\) a value\-based offline critic\. At inference time, generated and retrieved actions are jointly evaluated, and the final action is selected via critic\-based ranking\.### 4\.1GMM\-Based Action Generation

Transformer\-based offline RL models generate actions via conditional sequence modeling:

at∼π​\(at∣τ0:t−1,R^t,st\)\.a\_\{t\}\\sim\\pi\(a\_\{t\}\\mid\\tau\_\{0:t\-1\},\\hat\{R\}\_\{t\},s\_\{t\}\)\.\(3\)Most existing Transformer\-based approaches in continuous action spaces adopt a deterministic regression head optimized with a mean squared error \(MSE\) objective\(Wang and Bovik,[2009](https://arxiv.org/html/2606.14192#bib.bib45)\)\. Such unimodal regression tends to average over diverse historical actions, as illustrated in Figure[2](https://arxiv.org/html/2606.14192#S3.F2)with real\-world examples from AuctionNet\. This issue is particularly pronounced in bidding environments, where conservative and aggressive strategies coexist, often resulting in collapsed and non\-informative actions\.

To explicitly capture multimodal bidding behaviors, we replace the deterministic action head with a Gaussian Mixture Model \(GMM\) head\(Reynolds,[2018](https://arxiv.org/html/2606.14192#bib.bib42)\)\. GMMs, following the Mixture Density Network paradigm\(Bishop,[1994](https://arxiv.org/html/2606.14192#bib.bib35)\), model conditional action distributions by predicting a set ofMMmixture components:

\{αm,μm,σm2\}m=1M\.\\big\\\{\\alpha\_\{m\},\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\\big\\\}\_\{m=1\}^\{M\}\.\(4\)αm∈\[0,1\]\\alpha\_\{m\}\\in\[0,1\]denotes the mixing coefficient of themm\-th component, satisfying∑m=1Mαm=1\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}=1, whileμm\\mu\_\{m\}andσm2\\sigma\_\{m\}^\{2\}represent the mean and variance of the corresponding Gaussian component, respectively\. All mixture parameters are dynamically predicted conditioned on the current trajectory context\. The resulting action distribution is then given by:

P​\(at∣τ0:t−1,R^t,st\)=∑m=1Mαm​𝒩​\(at∣μm,σm2\),P\(a\_\{t\}\\mid\\tau\_\{0:t\-1\},\\hat\{R\}\_\{t\},s\_\{t\}\)=\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\\,\\mathcal\{N\}\(a\_\{t\}\\mid\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\),\(5\)which forms a multi\-peaked density capable of representing distinct bidding modes\. This GMM\-based variant is trained by maximizing the log\-likelihood of historical actions from the offline dataset𝒟\\mathcal\{D\}:

ℒGMM=−𝔼τ∼𝒟​\[∑t=1Tlog⁡\(∑m=1Mαm​𝒩​\(at∣μm,σm2\)\)\]\.\\small\\mathcal\{L\}\_\{\\mathrm\{GMM\}\}=\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{D\}\}\\Bigg\[\\sum\_\{t=1\}^\{T\}\\log\\Big\(\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\\,\\mathcal\{N\}\(a\_\{t\}\\mid\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\)\\Big\)\\Bigg\]\.\(6\)This distributional objective enables the policy to represent multiple bidding strategies simultaneously, rather than collapsing them into a single point estimate\.

Inference\-Time Sampling\.Unlike deterministic policies that output a single action, the GMM\-based action head enables stochastic sampling at inference time\. Given the predicted mixture parameters, a set of candidate actions𝒜gen=\{at\(l\)\}l=1L\\mathcal\{A\}\_\{\\text\{gen\}\}=\\\{a\_\{t\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}is generated by sampling from the learned mixture distribution:

at\(l\)∼∑m=1Mαm​𝒩​\(μm,σm2\)\.a\_\{t\}^\{\(l\)\}\\sim\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\\,\\mathcal\{N\}\(\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\)\.\(7\)This sampling mechanism preserves multiple plausible bidding modes and yields a diverse candidate pool for subsequent evaluation\.

### 4\.2Retrieval\-Augmented Candidate Generation

Retrieval\-augmented generation\(Lewiset al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib30)\)improves robustness under sparse inputs by grounding parametric models with high\-quality samples from the training data\. Motivated by this property, we incorporate a retrieval mechanism into the decision\-making process to augment the parametric Transformer\-based policy with relevant examples drawn from the offline dataset\. This design enables the policy to anchor its decisions to previously observed high\-performing behaviors, particularly in sparse and long\-tail bidding scenarios\. Specifically, the GMM\-based Transformer encoder is first used to encode the offline dataset into contextual state embeddings, and at inference time, high\-quality actions are retrieved based on state similarity to the current decision context\.

Retrieval Index Construction\.We construct a retrieval indexℐ\\mathcal\{I\}from the offline training dataset using contextual state embeddings\. While the pre\-trained policy encoder can be reused to minimize overhead, for large\-scale industrial tasks, we employ a dedicated lightweight Transformer encoder\. This design reduces embedding dimensions for efficient search without compromising policy capacity\. Rather than storing raw states, each trajectory in the offline dataset𝒟\\mathcal\{D\}is processed by the encoder to obtain a contextual embedding at every time step:

ht=fenc​\(τ0:t−1,R^t,st\)∈ℝd,h\_\{t\}=f\_\{\\text\{enc\}\}\(\\tau\_\{0:t\-1\},\\hat\{R\}\_\{t\},s\_\{t\}\)\\in\\mathbb\{R\}^\{d\},\(8\)wherefenc​\(⋅\)f\_\{\\text\{enc\}\}\(\\cdot\)denotes the contextual encoder, andhth\_\{t\}corresponds to the hidden representation of the state before the action head\. This embedding captures both temporal dependencies and semantic context\.

To improve retrieval efficiency and candidate quality, optional lightweight filtering can be applied during index construction to remove low\-quality or uninformative transitions, depending on practical requirements\. For each retained transition, the contextual embeddinghth\_\{t\}is used as the retrieval key, while the corresponding actionata\_\{t\}and Return\-to\-goR^t\\hat\{R\}\_\{t\}are stored as aligned values\. Similarity search is performed directly in this embedding space\. Additional implementation details of the retrieval module are in Appendix[A\.3](https://arxiv.org/html/2606.14192#A1.SS3)\.

Inference\-Time Retrieval\.At inference time, the current decision context at time stepttis encoded by the Transformer encoder to obtain the contextual state embeddinghth\_\{t\}, as defined in Equation \([8](https://arxiv.org/html/2606.14192#S4.E8)\)\. To retrieve actions that are both contextually relevant and high\-performing, we adopt a retrieve\-then\-filter strategy\. First, a candidate pool ofKpoolK\_\{\\text\{pool\}\}nearest neighbors is retrieved from the indexℐ\\mathcal\{I\}based on cosine similarity\(Xiaet al\.,[2015](https://arxiv.org/html/2606.14192#bib.bib46)\):

𝒞pool=\{\(ak,R^k\)\|k∈Top\-​Kpoolsim​\(ℐ,ht\)\},\\mathcal\{C\}\_\{\\text\{pool\}\}=\\big\\\{\(a\_\{k\},\\hat\{R\}\_\{k\}\)\\;\\big\|\\;k\\in\\text\{Top\-\}K\_\{\\text\{pool\}\}^\{\\mathrm\{sim\}\}\(\\mathcal\{I\},h\_\{t\}\)\\big\\\},\(9\)whereTop\-​Kpoolsim\\text\{Top\-\}K\_\{\\text\{pool\}\}^\{\\mathrm\{sim\}\}returns the indices of theKpoolK\_\{\\text\{pool\}\}entries inℐ\\mathcal\{I\}with the highest cosine similarity tohth\_\{t\}\. The retrieved candidates are then ranked according to their stored RTG values, and the top\-KKactions are selected:

𝒜ret=\{ak\|k∈Top\-​KR^​\(𝒞pool\)\}\.\\mathcal\{A\}\_\{\\text\{ret\}\}=\\big\\\{a\_\{k\}\\;\\big\|\\;k\\in\\text\{Top\-\}K^\{\\hat\{R\}\}\(\\mathcal\{C\}\_\{\\text\{pool\}\}\)\\big\\\}\.\(10\)Top\-​KR^\\text\{Top\-\}K^\{\\hat\{R\}\}returns the indices of theKKcandidates with the highest stored RTG values\. These retrieved actions complement the generated candidates by providing high\-quality references from the training data, improving decision robustness under sparse and long\-tail bidding conditions\.

### 4\.3Value\-Based Action Evaluation

While the GMM head provides diverse probabilistic candidates and the retrieval module supplies high\-quality references from the dataset, relying on either source alone can be risky\. Generative candidates capture multimodal bidding behaviors but may suffer from model uncertainty, whereas retrieved actions offer stability but can be suboptimal when the current context differs from past observations\. To robustly select the final bid, we introduce a value\-based critic to evaluate all candidate actions before execution\.

A broad class of value\-based offline RL methods can be applied to train the critic for evaluating candidate actions and determining the final bid\. In this work, the critic follows the Implicit Q\-Learning \(IQL\) paradigm\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.14192#bib.bib22)\), which estimates action values strictly within the support of the offline dataset without explicitly penalizing unseen actions\. Specifically, two Q\-functions and a state value function are learned from offline data\. The state value function is trained to approximate an upper expectile of the Q\-value distribution, enabling implicit maximization over in\-sample actions\. This is achieved via expectile regression\(Jianget al\.,[2017](https://arxiv.org/html/2606.14192#bib.bib47)\):

ℒV=𝔼\(s,a\)∼𝒟​\[L2η​\(mini=1,2⁡Qi​\(s,a\)−V​\(s\)\)\],\\mathcal\{L\}\_\{V\}=\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\Big\[L\_\{2\}^\{\\eta\}\\big\(\\min\_\{i=1,2\}Q\_\{i\}\(s,a\)\-V\(s\)\\big\)\\Big\],\(11\)whereL2η​\(u\)=\|η−𝕀​\(u<0\)\|​u2L\_\{2\}^\{\\eta\}\(u\)=\|\\eta\-\\mathbb\{I\}\(u<0\)\|u^\{2\}andη∈\(0\.5,1\)\\eta\\in\(0\.5,1\)controls the degree of implicit maximization\. The Q\-functions are then trained using a Bellman target constructed from the learned state value:

ℒQ=𝔼\(s,a,r,s′\)∼𝒟​\[\(Q​\(s,a\)−\(r\+γ​V​\(s′\)\)\)2\],\\mathcal\{L\}\_\{Q\}=\\mathbb\{E\}\_\{\(s,a,r,s^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\Big\[\\big\(Q\(s,a\)\-\(r\+\\gamma V\(s^\{\\prime\}\)\)\\big\)^\{2\}\\Big\],\(12\)which ensures strictly in\-sample learning and provides a stable critic for offline evaluation\.

Crucially, the critic adapts to task requirements by shaping the reward to reflect safety constraints\. In unconstrained settings, the raw rewardrris used, whereas for constrained tasks it is replaced in Equation \([12](https://arxiv.org/html/2606.14192#S4.E12)\) with a constraint\-aware rewardr′r^\{\\prime\}\. Specifically, for CPA\-constrained tasks, the shaped reward is defined as:

r′=r×min⁡\(1,\(𝒦C\+ϵ\)β\),r^\{\\prime\}=r\\times\\min\\bigg\(1,\\Big\(\\frac\{\\mathcal\{K\}\}\{C\+\\epsilon\}\\Big\)^\{\\beta\}\\bigg\),\(13\)where𝒦\\mathcal\{K\}denotes the target CPA threshold,CCrepresents the realized CPA, andβ=2\\beta=2controls the penalty steepness\. This shaping ensures the learned value landscape inherently reflects safety, guiding the agent toward feasible regions\.

Inference\-Time Decision\-Making\.At inference time, DRIVE generates the final decision by invoking the above three modules\. Specifically, it first samplesLLcandidate actions𝒜gen\\mathcal\{A\}\_\{\\text\{gen\}\}from the GMM\-based policy to cover diverse bidding modes\. In parallel, a set ofKKhigh\-quality actions𝒜ret\\mathcal\{A\}\_\{\\text\{ret\}\}is retrieved using the RTG\-guided retrieval strategy described earlier\. The two sets are then combined into a unified candidate pool𝒜cand=𝒜gen∪𝒜ret\\mathcal\{A\}\_\{\\text\{cand\}\}=\\mathcal\{A\}\_\{\\text\{gen\}\}\\cup\\mathcal\{A\}\_\{\\text\{ret\}\}\. The final action is selected by evaluating all candidates with the learned critic:

a∗=arg⁡maxa∈𝒜cand⁡mini=1,2⁡Qi​\(s,a\)\.a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\text\{cand\}\}\}\\min\_\{i=1,2\}Q\_\{i\}\(s,a\)\.\(14\)This inference procedure integrates the diversity of generative candidates, the reliability of retrieved actions, and value\-based evaluation to produce robust decisions\. Through this design, DRIVE can be integrated into other Transformer\-based offline RL algorithms, providing a general solution to the average\-action issue and unreliable decisions under long\-tail and sparse data regimes, particularly in auto\-bidding\.

Table 1:Comparison with baselines on AuctionNet and AuctionNet\-Sparse datasets under different budget constraints\. We report values \(mean±\\pmstandard deviation\) over 10 seeds\. The best results arebolded, and the second\-best results areunderlined\.DatasetBudgetCQLIQLBCQDiffBidDTCDTGASGAVEGAVE\-SDRIVEAuctionNet50%212±3\.06212\\pm 3\.06194±2\.07194\\pm 2\.07181±3\.26181\\pm 3\.26155±2\.59155\\pm 2\.59208±1\.75208\\pm 1\.75208±2\.06208\\pm 2\.06200±2\.68200\\pm 2\.68133±1\.69133\\pm 1\.69108±1\.36108\\pm 1\.36𝟐𝟏𝟐±1\.57\\bm\{212\\pm 1\.57\}75%300±2\.65300\\pm 2\.65284±1\.76284\\pm 1\.76263±3\.39263\\pm 3\.39225±1\.47225\\pm 1\.47298±2\.21298\\pm 2\.21300±2\.00300\\pm 2\.00295±3\.33295\\pm 3\.33192±1\.74192\\pm 1\.74158±1\.49158\\pm 1\.49297±2\.25297\\pm 2\.25100%382±2\.25382\\pm 2\.25366±1\.82366\\pm 1\.82343±8\.2343\\pm 8\.2285±1\.85285\\pm 1\.85373±3\.18373\\pm 3\.18382±3\.19382\\pm 3\.19381±2\.71381\\pm 2\.71245±1\.00245\\pm 1\.00209±2\.39209\\pm 2\.39𝟑𝟗𝟗±3\.74\\bm\{399\\pm 3\.74\}125%463±2\.61463\\pm 2\.61444±2\.30444\\pm 2\.30414±12\.34414\\pm 12\.34334±2\.80334\\pm 2\.80430±2\.98430\\pm 2\.98450±2\.68450\\pm 2\.68457±3\.07457\\pm 3\.07298±2\.05298\\pm 2\.05261±2\.52261\\pm 2\.52𝟒𝟕𝟓±5\.33\\bm\{475\\pm 5\.33\}150%535±2\.97535\\pm 2\.97500±2\.57500\\pm 2\.57478±8\.65478\\pm 8\.65377±3\.05377\\pm 3\.05477±2\.12477\\pm 2\.12508±2\.67508\\pm 2\.67525±3\.24525\\pm 3\.24350±2\.10350\\pm 2\.10316±2\.61316\\pm 2\.61𝟓𝟓𝟏±4\.64\\bm\{551\\pm 4\.64\}Avergae378\.4357\.6335\.8275\.2357\.2369\.6371\.6243\.6210\.4386\.6AuctionNetSparse50%20\.2±0\.6920\.2\\pm 0\.6917\.9±0\.6317\.9\\pm 0\.6317\.9±0\.3817\.9\\pm 0\.3814\.9±0\.6014\.9\\pm 0\.6015\.8±0\.5715\.8\\pm 0\.5717\.8±0\.6517\.8\\pm 0\.6514\.2±0\.6814\.2\\pm 0\.688\.7±0\.368\.7\\pm 0\.3617\.6±0\.6617\.6\\pm 0\.6620\.4±0\.44\\bm\{20\.4\\pm 0\.44\}75%28\.8±0\.7228\.8\\pm 0\.7226\.9±0\.6626\.9\\pm 0\.6626\.7±0\.6226\.7\\pm 0\.6220\.2±0\.6820\.2\\pm 0\.6823\.1±0\.2223\.1\\pm 0\.2226\.9±0\.5326\.9\\pm 0\.5320\.8±0\.8420\.8\\pm 0\.849\.8±0\.419\.8\\pm 0\.4125\.7±1\.0925\.7\\pm 1\.0927\.8±0\.5327\.8\\pm 0\.53100%37\.1±0\.5937\.1\\pm 0\.5935\.2±1\.0935\.2\\pm 1\.0934\.2±0\.7634\.2\\pm 0\.7624\.3±0\.5424\.3\\pm 0\.5430\.6±0\.6930\.6\\pm 0\.6935\.9±0\.6835\.9\\pm 0\.6827\.1±0\.8327\.1\\pm 0\.839\.9±0\.439\.9\\pm 0\.4334\.3±1\.0434\.3\\pm 1\.0437\.3±0\.87\\bm\{37\.3\\pm 0\.87\}125%44\.6±1\.1244\.6\\pm 1\.1243\.7±0\.7743\.7\\pm 0\.7742\.0±1\.0542\.0\\pm 1\.0528\.2±0\.7528\.2\\pm 0\.7537\.9±0\.5437\.9\\pm 0\.5444\.1±1\.0944\.1\\pm 1\.0933\.1±0\.6733\.1\\pm 0\.679\.9±0\.459\.9\\pm 0\.4541\.3±1\.1441\.3\\pm 1\.1443\.1±1\.1843\.1\\pm 1\.18150%49\.6±1\.2749\.6\\pm 1\.2751\.4±0\.9751\.4\\pm 0\.9747\.7±0\.8847\.7\\pm 0\.8831\.4±0\.6131\.4\\pm 0\.6145\.7±0\.8945\.7\\pm 0\.8950\.6±1\.6150\.6\\pm 1\.6140\.2±0\.8340\.2\\pm 0\.8310\.0±0\.4610\.0\\pm 0\.4649\.2±0\.6549\.2\\pm 0\.6551\.8±0\.53\\bm\{51\.8\\pm 0\.53\}Avergae36\.0635\.0233\.723\.830\.6233\.0627\.089\.6633\.6236\.08Table 2:Comparison with baselines on D4RL benchmarks\. We report normalized scores over 5 seeds\. The best results arebolded, and the second\-best results areunderlined\. The results for baselines are taken from original papers\.DomainDatasetCQLIQLBEARTD3\+BCBCDTPDiTDRIVEGymMuJoCohalfcheetah\-expert62\.486\.753\.490\.786\.2±9\.486\.2\\pm 9\.491\.7±0\.391\.7\\pm 0\.373\.0±4\.373\.0\\pm 4\.393\.0±\\pm0\.9hopper\-expert111\.091\.596\.398\.067\.5±13\.167\.5\\pm 13\.1109\.8±\\pm0\.5111\.4±\\pm0\.1106\.1±1\.2106\.1\\pm 1\.2walker2d\-expert98\.7109\.640\.1110\.1108\.7±0\.3108\.7\\pm 0\.3108\.9±0\.1108\.9\\pm 0\.1108\.8±0\.4108\.8\\pm 0\.4108\.6±0\.1108\.6\\pm 0\.1halfcheetah\-medium44\.447\.441\.748\.440\.5±0\.140\.5\\pm 0\.140\.0±0\.140\.0\\pm 0\.142\.8±2\.342\.8\\pm 2\.346\.7±0\.146\.7\\pm 0\.1hopper\-medium58\.066\.352\.159\.359\.9±1\.559\.9\\pm 1\.563\.6±2\.663\.6\\pm 2\.668\.2±\\pm2\.467\.3±1\.567\.3\\pm 1\.5walker2d\-medium79\.278\.359\.183\.778\.8±1\.178\.8\\pm 1\.178\.1±1\.578\.1\\pm 1\.577\.6±0\.677\.6\\pm 0\.681\.0±\\pm0\.1halfcheetah\-medium\-replay46\.244\.238\.644\.635\.8±0\.735\.8\\pm 0\.735\.0±1\.035\.0\\pm 1\.040\.8±2\.340\.8\\pm 2\.343\.7±0\.243\.7\\pm 0\.2hopper\-medium\-replay48\.694\.733\.760\.948\.0±28\.248\.0\\pm 28\.278\.7±0\.378\.7\\pm 0\.389\.6±\\pm2\.789\.2±2\.189\.2\\pm 2\.1walker2d\-medium\-replay26\.773\.919\.281\.857\.5±3\.357\.5\\pm 3\.371\.5±1\.771\.5\\pm 1\.774\.1±0\.674\.1\\pm 0\.682\.0±\\pm2\.9Gym Average63\.977\.048\.275\.364\.875\.376\.379\.7Maze2Dmaze2d\-umaze94\.742\.165\.714\.816\.4±4\.716\.4\\pm 4\.760\.3±7\.760\.3\\pm 7\.773\.2±11\.673\.2\\pm 11\.656\.3±4\.156\.3\\pm 4\.1maze2d\-medium41\.834\.925\.062\.120\.1±8\.420\.1\\pm 8\.437\.0±8\.337\.0\\pm 8\.351\.2±4\.951\.2\\pm 4\.9136\.8±\\pm8\.6maze2d\-large49\.661\.781\.088\.610\.3±8\.610\.3\\pm 8\.625\.4±4\.925\.4\\pm 4\.940\.0±10\.240\.0\\pm 10\.284\.4±5\.384\.4\\pm 5\.3Maze2D Average62\.046\.257\.255\.215\.640\.954\.892\.5

## 5Experiment

Extensive experiments are conducted to evaluate the proposed DRIVE framework\. The primary objective is to examine its ability to alleviate the “Average Action” issue in multimodal decision landscapes and to improve decision reliability under sparse and long\-tail data regimes in auto\-bidding scenarios\. In addition, representative offline RL tasks are used to assess its general effectiveness and transferability beyond bidding\. A comprehensive ablation study further analyzes the contribution of each core component\. Finally, DRIVE is integrated into multiple Transformer\-based offline RL architectures to demonstrate its plug\-and\-play generality\. Detailed experimental settings and hyperparameters are provided in Appendix[A\.4](https://arxiv.org/html/2606.14192#A1.SS4)for reproducibility\.

Datasets\.The proposed framework is evaluated on*AuctionNet*\(Suet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib36)\)and the*D4RL*benchmark\(Fuet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib37)\)\. AuctionNet is a large\-scale industrial benchmark constructed from real\-world bidding logs, featuring highly stochastic dynamics and strict budget and CPA constraints\. Both dense and sparse variants are used to assess robustness under challenging bidding conditions\. To evaluate generalization, D4RL benchmarks including Gym\-MuJoCo and Maze2D are adopted, covering continuous control and long\-horizon navigation under varying data quality\. Additional dataset statistics and feature descriptions are provided in Appendix[B\.1](https://arxiv.org/html/2606.14192#A2.SS1)\.

Baselines\.The baselines are grouped into three categories to cover representative algorithmic paradigms in offline RL and auto\-bidding: \(1\)*Classical Offline RL Methods*, including BCQ\(Fujimotoet al\.,[2019](https://arxiv.org/html/2606.14192#bib.bib20)\), CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib21)\), IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.14192#bib.bib22)\), TD3\+BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2606.14192#bib.bib38)\), and BEAR\(Kumaret al\.,[2019](https://arxiv.org/html/2606.14192#bib.bib40)\)\. These traditional methods provide strong stability guarantees but offer limited expressiveness\. \(2\)*Generative Sequence Modeling Methods*, including Behavior Cloning \(BC\), Decision Transformer \(DT\)\(Chenet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib23)\), and PDiT\(Maoet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib39)\)\. These methods are included to examine the limitations of deterministic or unimodal regression objectives, particularly the “Average Action” phenomenon\. \(3\)*Auction\-Specific Baselines*, consisting of constrained optimization methods such as CDT\(Liuet al\.,[2023](https://arxiv.org/html/2606.14192#bib.bib48)\), as well as recent generative bidding models including GAS\(Liet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib11)\), GAVE and GAVE\-S\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14192#bib.bib12)\), and the diffusion\-based DiffBid\(Guoet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib15)\), which represent advanced generative approaches for auto\-bidding\. Unless otherwise specified, DRIVE is implemented on a DT backbone\.

Evaluation Metrics\.On AuctionNet, the primary evaluation metric in the main experiments is the*Value*∑r\\sum r, which measures unconstrained performance\. For constrained bidding tasks, reported in Appendix[C\.5](https://arxiv.org/html/2606.14192#A3.SS5), we additionally adopt the*Score*, computed using the shaped reward∑r′\\sum r^\{\\prime\}defined in Equation \([13](https://arxiv.org/html/2606.14192#S4.E13)\), to explicitly reflect constraint satisfaction\. On D4RL, the standard*Normalized Score*\(Fuet al\.,[2020](https://arxiv.org/html/2606.14192#bib.bib37)\)is reported for fair comparison across tasks\.

### 5\.1Overall Performance

This section presents a comprehensive comparison between DRIVE and baseline methods\. Table[1](https://arxiv.org/html/2606.14192#S4.T1)reports results on the AuctionNet benchmarks, while Table[2](https://arxiv.org/html/2606.14192#S4.T2)summarizes normalized scores on the D4RL Gym and Maze2D domains\. Additional results and extended analyses are provided in Appendix[C](https://arxiv.org/html/2606.14192#A3)\.

Results on AuctionNet\.Across all budget settings, DRIVE consistently achieves superior performance on AuctionNet\. On the particularly challenging AuctionNet\-Sparse benchmark, purely generative baselines such as DT and DiffBid degrade noticeably due to severe data sparsity\. In contrast, DRIVE effectively combines generative flexibility with value\-based evaluation, leading to robust performance in low\-density regimes\. By anchoring action selection to retrieved high\-quality decisions from the offline dataset, DRIVE mitigates the instability and hallucination issues commonly observed in purely parametric models\. A detailed case study illustrating how DRIVE alleviates the Average Action issue by capturing multimodal bidding behaviors through the GMM\-based policy is provided in Appendix[C](https://arxiv.org/html/2606.14192#A3)\.

Results on Gym Tasks\.As shown in the results, DRIVE attains the highest average normalized score across the Gym locomotion suite, outperforming both traditional value\-based methods and recent generative approaches\. While most methods exhibit similar performance on expert datasets, DRIVE shows clear advantages on mixed\-quality datasets such asmediumandmedium\-replay\. For example, onwalker2d\-medium\-replay, DRIVE outperforms DT by 14\.7%\. These results indicate that the proposed distributional action modeling combined with value\-guided selection effectively alleviates the average\-action issue in sequence modeling, enabling recovery of high\-quality behaviors from suboptimal data\.

Results on Maze2D Domain\.In the Maze2D domain, DRIVE demonstrates a significant performance advantage over these baselines\. On the challengingmaze2d\-mediumtask, DRIVE achieves a score of 136\.8, exceeding all competing methods by a substantial margin\. The strong performance on complex tasks highlights the importance of retrieval\-augmented decision making\. By grounding actions in retrieved transitions, DRIVE supports stable long\-horizon planning and reduces instability in sparse regions\. Furthermore, a qualitative analysis is presented in Appendix[C\.2](https://arxiv.org/html/2606.14192#A3.SS2)\.

Table 3:Ablation study on core components\.BudgetActor OnlyRetr\. \+ CriticGen\. \+ CriticDRIVE\(%\)\(Dominant Mean\)\(Selection\)\(No Retr\.\)\(Full\)50196\.4±\\pm1\.45185\.4±\\pm1\.86205\.4±\\pm3\.14211\.8±\\pm1\.5775290\.1±\\pm3\.32281\.9±\\pm3\.37296\.7±\\pm2\.39297\.0±\\pm2\.25100371\.9±\\pm2\.82370\.4±\\pm1\.58378\.4±\\pm2\.68399\.0±\\pm3\.74125450\.3±\\pm3\.85449\.1±\\pm4\.62461\.3±\\pm6\.70475\.4±\\pm5\.33150519\.8±\\pm2\.81490\.3±\\pm1\.10532\.2±\\pm2\.16550\.6±\\pm4\.64
### 5\.2Ablation Study

Table[3](https://arxiv.org/html/2606.14192#S5.T3)reports a component\-wise ablation analysis of the proposed framework\. Comparing the deterministic*Actor Only*baseline with the*Gen\. \+ Critic*variant reveals consistent performance gains across all budget settings, indicating that value\-guided stochastic sampling is more effective than selecting a single most\-likely action\. By evaluating multiple sampled candidates, the critic is able to identify high\-value actions that do not necessarily correspond to the dominant mixture component, alleviating bias toward the most frequent bidding mode\. Further incorporating the retrieval module \(full DRIVE\) leads to additional and substantial improvements\. This demonstrates that retrieval\-based augmentation provides an effective non\-parametric correction by anchoring decision making to high\-quality examples from the offline dataset\. Such anchoring mitigates hallucinated or unreliable actions produced by the parametric policy, particularly near complex decision boundaries and under sparse data conditions\.

### 5\.3Comparison of GMM and Diffusion Action Head

To validate the choice of the GMM\-based action head, we perform a controlled experiment replacing it with a diffusion\-based head \(DDPM, T=100 steps\) while keeping the Transformer backbone and IQL critic unchanged\. Table[4](https://arxiv.org/html/2606.14192#S5.T4)reports the average performance across budgets\. GMM captures multimodal actions via a closed\-form mixture likelihood, avoiding the ”Average Action Trap” without the generative overhead of diffusion\. Across all budgets, the GMM head achieves comparable or slightly better performance than the diffusion\-based head\. Importantly, it drastically reduces inference latency, requiring only 11 ms per step compared to 223 ms for the diffusion\-based head\. These results confirm that the GMM head provides a superior trade\-off between performance and efficiency, making it well\-suited for real\-time bidding scenarios where low\-latency decisions are critical\. Overall, we chose the GMM head rather than diffusion to balance modeling parsimony, performance, and industrial feasibility\.

Table 4:Effect of GMM vs Diffusion Head\.Budget \(%\)GMM Gen\. \+ Critic \(No Retr\.\)Diffusion head\+Critic50205\.4±\\pm3\.14190\.7±\\pm2\.2175296\.7±\\pm2\.39273\.2±\\pm1\.82100378\.4±\\pm2\.68376\.3±\\pm5\.72125461\.3±\\pm4\.62467\.2±\\pm2\.53150532\.2±\\pm2\.16531\.4±\\pm4\.05
### 5\.4Quantitative Analysis of Q\-Function Multimodality

To evaluate how frequently Q\-functions exhibit multimodality and its effect on the Average Action Trap, we randomly sample 2,000 states from the test set\. For each state, we discretize the continuous action space into 100 uniformly spaced points and compute the corresponding Q\-values\. A state is classified as unimodal if its Q\-function has a single peak, or multimodal if it has two or more local peaks\. We evaluate the DT policy on these states by computing two metrics, the suboptimal rate, defined as the proportion of decision steps where the DT output action falls below the 80th percentile of Q\-values among the 100 sampled actions, and the distance to the optimal actiona∗=arg⁡max⁡Q​\(s,a\)a^\{\*\}=\\arg\\max Q\(s,a\), measuring the absolute difference between DT’s chosen Each state is classified as unimodal \(1 peak\) or multimodal \(≥2\\geq 2peaks\) based on the number of local peaks in its Q\-function\.

Table[5](https://arxiv.org/html/2606.14192#S5.T5)summarizes the overall statistics aggregated across all delivery periods P7–P13\. We find that 17\.6% of states are multimodal, where the DT mean\-regression output falls into suboptimal regions with a suboptimal rate of 54\.7%, compared to 38\.2% for unimodal states\. The average distance to the optimal action increases from 4\.24 for unimodal states to 6\.10 for multimodal states, confirming that the Average Action Trap is more severe in multimodal regions\. This quantitative evidence supports the need for a GMM\-based policy to preserve multiple bidding modes and mitigate systematic failure in offline auto\-bidding\. More detailed per\-period visualizations and analysis are provided in Appendix[C\.3](https://arxiv.org/html/2606.14192#A3.SS3)\.

Table 5:Q\-function multimodality analysis and DT suboptimality\.Q ShapeProportionSuboptimalRate \(%\)Distanceetoa∗a^\{\*\}Unimodal82\.4%38\.24\.24Multimodal17\.6%54\.76\.10Overall100%41\.14\.51![Refer to caption](https://arxiv.org/html/2606.14192v1/x4.png)Figure 4:Performance across budget constraints for diverse backbone methods\.
### 5\.5Generalization Across Diverse Backbones

To evaluate the general applicability of DRIVE, its core components are integrated into three representative Transformer\-based policy backbones beyond the vanilla DT, including BC, CDT, and PDiT\. As shown in Figure[4](https://arxiv.org/html/2606.14192#S5.F4), this integration consistently improves performance across all backbones\. In particular, the PDiT backbone achieves a substantial 19\.2% improvement in average total reward\. These results indicate that the proposed distributional and retrieval\-augmented framework effectively mitigates mode collapse and sampling instability across different generative architectures, independent of the specific backbone design\. Detailed numerical results are provided in Appendix[C\.4](https://arxiv.org/html/2606.14192#A3.SS4)\.

Further analyses are provided to validate the robustness of DRIVE under practical constraints\. In particular, results in Appendix[C\.5](https://arxiv.org/html/2606.14192#A3.SS5)show that the constraint\-aware critic is essential for constrained bidding tasks, as removing the penalty term leads to severe violation of CPA limits\. Appendix[C\.6](https://arxiv.org/html/2606.14192#A3.SS6)compares different value\-based critics, while sensitivity analyses in Appendix[C\.7](https://arxiv.org/html/2606.14192#A3.SS7)and Appendix[C\.8](https://arxiv.org/html/2606.14192#A3.SS8)demonstrate that DRIVE remains stable across a wide range of sampling and retrieval hyperparameters\. Despite these performance gains, DRIVE incurs only a modest additional computational overhead at inference time\. A detailed analysis of time and cost efficiency is provided in Appendix[C\.9](https://arxiv.org/html/2606.14192#A3.SS9)\.

## 6Conclusion

This paper presents DRIVE, a unified framework for offline auto\-bidding that addresses multimodal optimal behaviors and unreliable decision making under sparse data\. DRIVE decouples candidate generation from decision\-making and mitigates the “Average Action” trap by synergizing distributional GMM modeling with retrieval\-augmented context\. Extensive experiments on the industrial AuctionNet benchmark and standard D4RL tasks demonstrate that DRIVE significantly outperforms existing methods and highlights its exceptional generalization and transferability across different Transformer\-based methods, providing a robust paradigm for deploying offline RL in bidding environments\.

## Acknowledgements

This work was supported by the National Natural Science Foundation of China \(Grant No\.62506210\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- S\. Balseiro, Y\. Deng, J\. Mao, V\. Mirrokni, and S\. Zuo \(2021a\)Robust auction design in the auto\-bidding world\.Advances in Neural Information Processing Systems34,pp\. 17777–17788\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- S\. R\. Balseiro, Y\. Deng, J\. Mao, V\. S\. Mirrokni, and S\. Zuo \(2021b\)The landscape of auto\-bidding auctions: value versus utility maximization\.InProceedings of the 22nd ACM Conference on Economics and Computation,pp\. 132–133\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- C\. M\. Bishop \(1994\)Mixture density networks\.Cited by:[§4\.1](https://arxiv.org/html/2606.14192#S4.SS1.p2.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\. Lespiau, B\. Damoc, A\. Clark,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InInternational conference on machine learning,pp\. 2206–2240\.Cited by:[§2\.3](https://arxiv.org/html/2606.14192#S2.SS3.p1.1)\.
- H\. Cai, K\. Ren, W\. Zhang, K\. Malialis, J\. Wang, Y\. Yu, and D\. Guo \(2017\)Real\-time bidding by reinforcement learning in display advertising\.InProceedings of the tenth ACM international conference on web search and data mining,pp\. 661–670\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§C\.4](https://arxiv.org/html/2606.14192#A3.SS4.p1.1),[§1](https://arxiv.org/html/2606.14192#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- Y\. Chen, P\. Berkhin, B\. Anderson, and N\. R\. Devanur \(2011\)Real\-time bidding algorithms for performance\-based display ad allocation\.InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1307–1315\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1)\.
- Y\. Deng, J\. Mao, V\. Mirrokni, and S\. Zuo \(2021\)Towards efficient auctions in an auto\-bidding world\.InProceedings of the Web Conference 2021,pp\. 3965–3973\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine \(2020\)D4RL: datasets for deep data\-driven reinforcement learning\.CoRRabs/2004\.07219\.External Links:2004\.07219Cited by:[§5](https://arxiv.org/html/2606.14192#S5.p2.1),[§5](https://arxiv.org/html/2606.14192#S5.p4.2)\.
- S\. Fujimoto and S\. S\. Gu \(2021\)A minimalist approach to offline reinforcement learning\.Advances in neural information processing systems34,pp\. 20132–20145\.Cited by:[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- S\. Fujimoto, H\. Hoof, and D\. Meger \(2018\)Addressing function approximation error in actor\-critic methods\.InInternational conference on machine learning,pp\. 1587–1596\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InInternational conference on machine learning,pp\. 2052–2062\.Cited by:[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p1.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- J\. Gao, Y\. Li, S\. Mao, P\. Jiang, N\. Jiang, Y\. Wang, Q\. Cai, F\. Pan, P\. Jiang, K\. Gai,et al\.\(2025\)Generative auto\-bidding with value\-guided explorations\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 244–254\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- J\. Guo, Y\. Huo, Z\. Zhang, T\. Wang, C\. Yu, J\. Xu, B\. Zheng, and Y\. Zhang \(2024\)Generative auto\-bidding via conditional diffusion modeling\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5038–5049\.Cited by:[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)Retrieval augmented language model pre\-training\.InInternational conference on machine learning,pp\. 3929–3938\.Cited by:[§2\.3](https://arxiv.org/html/2606.14192#S2.SS3.p1.1)\.
- Y\. He, X\. Chen, D\. Wu, J\. Pan, Q\. Tan, C\. Yu, J\. Xu, and X\. Zhu \(2021\)A unified solution to constrained bidding in online display advertising\.InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining,pp\. 2993–3001\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.14192#S3.SS1.p1.11)\.
- M\. Janner, Q\. Li, and S\. Levine \(2021\)Offline reinforcement learning as one big sequence modeling problem\.Advances in neural information processing systems34,pp\. 1273–1286\.Cited by:[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1)\.
- C\. Jiang, M\. Jiang, Q\. Xu, and X\. Huang \(2017\)Expectile regression neural network model with applications\.Neurocomputing247,pp\. 73–86\.Cited by:[§4\.3](https://arxiv.org/html/2606.14192#S4.SS3.p2.3)\.
- J\. Johnson, M\. Douze, and H\. Jégou \(2019\)Billion\-scale similarity search with gpus\.IEEE Transactions on Big Data7\(3\),pp\. 535–547\.Cited by:[§A\.3](https://arxiv.org/html/2606.14192#A1.SS3.p3.3)\.
- J\. Kang, R\. Laroche, X\. Yuan, A\. Trischler, X\. Liu, and J\. Fu \(2024\)Think before you act: decision transformers with working memory\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 23001–23021\.Cited by:[§2\.3](https://arxiv.org/html/2606.14192#S2.SS3.p1.1)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2022\)Offline reinforcement learning with implicit q\-learning\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,Cited by:[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.14192#S4.SS3.p2.3),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- A\. Kumar, J\. Fu, M\. Soh, G\. Tucker, and S\. Levine \(2019\)Stabilizing off\-policy q\-learning via bootstrapping error reduction\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 11761–11771\.Cited by:[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.Advances in neural information processing systems33,pp\. 1179–1191\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p1.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- K\. Lee, A\. Jalali, and A\. Dasdan \(2013\)Real time bid optimization with smooth budget delivery in online advertising\.InProceedings of the seventh international workshop on data mining for online advertising,pp\. 1–9\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§2\.3](https://arxiv.org/html/2606.14192#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.14192#S4.SS2.p1.1)\.
- Y\. Li, S\. Mao, J\. Gao, N\. Jiang, Y\. Xu, Q\. Cai, F\. Pan, P\. Jiang, and B\. An \(2025\)GAS: generative auto\-bidding with post\-training search\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 315–324\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- M\. Liu, L\. Jiaxing, Z\. Hu, J\. Liu, and X\. Nie \(2020\)A dynamic bidding strategy based on model\-free reinforcement learning in display advertising\.IEEE Access8,pp\. 213587–213601\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1)\.
- Z\. Liu, Z\. Guo, Y\. Yao, Z\. Cen, W\. Yu, T\. Zhang, and D\. Zhao \(2023\)Constrained decision transformer for offline safe reinforcement learning\.InInternational conference on machine learning,pp\. 21611–21630\.Cited by:[§C\.4](https://arxiv.org/html/2606.14192#A3.SS4.p1.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- B\. Lucier, R\. Paes Leme, and É\. Tardos \(2012\)On revenue in the generalized second price auction\.InProceedings of the 21st international conference on World Wide Web,pp\. 361–370\.Cited by:[§3\.1](https://arxiv.org/html/2606.14192#S3.SS1.p1.8)\.
- H\. Mao, R\. Zhao, Z\. Li, Z\. Xu, H\. Chen, Y\. Chen, B\. Zhang, Z\. Xiao, J\. Zhang, and J\. Yin \(2024\)PDiT: interleaving perception and decision\-making transformers for deep reinforcement learning\.InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems,pp\. 1363–1371\.Cited by:[§C\.4](https://arxiv.org/html/2606.14192#A3.SS4.p1.1),[§5](https://arxiv.org/html/2606.14192#S5.p3.1)\.
- W\. Ou, B\. Chen, Y\. Yang, X\. Dai, W\. Liu, W\. Zhang, R\. Tang, and Y\. Yu \(2023\)Deep landscape forecasting in multi\-slot real\-time bidding\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4685–4695\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- C\. Perlich, B\. Dalessandro, R\. Hook, O\. Stitelman, T\. Raeder, and F\. Provost \(2012\)Bid optimizing and inventory scoring in targeted online advertising\.InProceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 804–812\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1)\.
- M\. L\. Puterman \(1990\)Markov decision processes\.Handbooks in operations research and management science2,pp\. 331–434\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1)\.
- D\. A\. Reynolds \(2018\)Gaussian mixture models\.InEncyclopedia of Biometrics,Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.14192#S4.SS1.p2.1)\.
- T\. Schmied, F\. Paischer, V\. Patil, M\. Hofmarcher, R\. Pascanu, and S\. Hochreiter \(2024\)Retrieval\-augmented decision transformer: external memory for in\-context rl\.arXiv preprint arXiv:2410\.07071\.Cited by:[§2\.3](https://arxiv.org/html/2606.14192#S2.SS3.p1.1)\.
- K\. Su, Y\. Huo, Z\. Zhang, S\. Dou, C\. Yu, J\. Xu, Z\. Lu, and B\. Zheng \(2024\)Auctionnet: a novel benchmark for decision\-making in large\-scale games\.Advances in Neural Information Processing Systems37,pp\. 94428–94452\.Cited by:[§B\.1](https://arxiv.org/html/2606.14192#A2.SS1.p4.1),[2nd item](https://arxiv.org/html/2606.14192#S1.I1.i2.p1.1),[§5](https://arxiv.org/html/2606.14192#S5.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA,I\. Guyon, U\. von Luxburg, S\. Bengio, H\. M\. Wallach, R\. Fergus, S\. V\. N\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 5998–6008\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p2.1)\.
- J\. Wang and S\. Yuan \(2015\)Real\-time bidding: a new frontier of computational advertising research\.InProceedings of the Eighth ACM International Conference on Web Search and Data Mining,pp\. 415–416\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- Z\. Wang and A\. C\. Bovik \(2009\)Mean squared error: love it or leave it? a new look at signal fidelity measures\.IEEE signal processing magazine26\(1\),pp\. 98–117\.Cited by:[§4\.1](https://arxiv.org/html/2606.14192#S4.SS1.p1.2)\.
- D\. Wu, X\. Chen, X\. Yang, H\. Wang, Q\. Tan, X\. Zhang, J\. Xu, and K\. Gai \(2018\)Budget constrained bidding by model\-free reinforcement learning in display advertising\.InProceedings of the 27th ACM International Conference on Information and Knowledge Management,pp\. 1443–1451\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p2.1)\.
- P\. Xia, L\. Zhang, and F\. Li \(2015\)Learning similarity with cosine similarity ensemble\.Information sciences307,pp\. 39–52\.Cited by:[§4\.2](https://arxiv.org/html/2606.14192#S4.SS2.p4.4)\.
- J\. Xu, K\. Lee, W\. Li, H\. Qi, and Q\. Lu \(2015\)Smart pacing for effective online ad campaign optimization\.InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining,pp\. 2217–2226\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1)\.
- Z\. Xu, M\. Cui, D\. Li, Z\. Liu, H\. Zhang, H\. Mao, G\. Fan, and B\. Zhang \(2026\)Peak\-return greedy slicing: subtrajectory selection for transformer\-based offline RL\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.14192#S2.SS2.p2.1)\.
- X\. Yang, Y\. Li, H\. Wang, D\. Wu, Q\. Tan, J\. Xu, and K\. Gai \(2019\)Bid optimization by multivariable control in display advertising\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 1966–1974\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1)\.
- S\. Yuan, J\. Wang, and X\. Zhao \(2013\)Real\-time bidding for online advertising: measurement and analysis\.InProceedings of the seventh international workshop on data mining for online advertising,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.14192#S1.p1.1)\.
- W\. Zhang, S\. Yuan, and J\. Wang \(2014\)Optimal real\-time bidding for display advertising\.InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1077–1086\.Cited by:[§2\.1](https://arxiv.org/html/2606.14192#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.14192#S3.SS1.p1.11)\.

## Appendix AImplementation Details

In this section, we provide a comprehensive account of the implementation details for the DRIVE framework to ensure reproducibility\. We begin by summarizing the mathematical notation used throughout the paper\. Subsequently, we present the complete algorithmic workflow, detailing the coordination between the GMM policy, retrieval mechanism, and offline critic\. Finally, we specify the exact network architectures, retrieval engineering choices, and hyperparameter configurations used for both the AuctionNet and D4RL benchmarks\.

### A\.1Table of Notation

Table 6:Table of Notation\.SymbolExplanationNNTotal number of impression opportunities in a period\.vi,civ\_\{i\},c\_\{i\}Value \(e\.g\., conversion\) and Cost of theii\-th impression\.B,𝒦jB,\\mathcal\{K\}\_\{j\}Total budget and target threshold for thejj\-th KPI\.st,at,rts\_\{t\},a\_\{t\},r\_\{t\}State, Action, and Reward at timesteptt\.R^t\\hat\{R\}\_\{t\}Return\-to\-go \(RTG\) conditioned at timesteptt\.𝒟\\mathcal\{D\}Offline dataset of trajectoriesτ\\tau\.τ0:t−1\\tau\_\{0:t\-1\}Trajectory prefix from timestep0tot−1t\-1\.MMNumber of mixture components in the GMM head\.αm,μm,σm2\\alpha\_\{m\},\\mu\_\{m\},\\sigma\_\{m\}^\{2\}Mixing coef\., mean, and variance of themm\-th component\.𝒩\(⋅∣μm,σm2\)\\mathcal\{N\}\(\\cdot\\mid\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\)mm\-th Gaussian density in the mixture\.ℒGMM\\mathcal\{L\}\_\{\\mathrm\{GMM\}\}GMM log\-likelihood training objective\.LLNumber of generated candidate actions from the GMM distribution of action\.𝒜gen\\mathcal\{A\}\_\{\\text\{gen\}\}Set of candidate actions generated from the GMM policy\.ℐ\\mathcal\{I\}Retrieval index built from offline embeddings\.hth\_\{t\}Contextual state embedding at timesteptt\.fencf\_\{\\text\{enc\}\}Transformer encoder that produceshth\_\{t\}\.ddDimensionality of contextual state embeddingshth\_\{t\}\.KpoolK\_\{\\text\{pool\}\}Size of the initial retrieved candidate pool\.𝒞pool\\mathcal\{C\}\_\{\\text\{pool\}\}Retrieved pool with actions and RTG statistics\.KKNumber of retrieved actions kept after RTG\-based filtering\.𝒜ret\\mathcal\{A\}\_\{\\text\{ret\}\}Set of retrieved candidate actions\.𝒜cand\\mathcal\{A\}\_\{\\text\{cand\}\}Unified candidate pool𝒜gen∪𝒜ret\\mathcal\{A\}\_\{\\text\{gen\}\}\\cup\\mathcal\{A\}\_\{\\text\{ret\}\}\.Qi​\(s,a\)Q\_\{i\}\(s,a\)ii\-th Q\-function used in the critic \(IQL\)\.V​\(s\)V\(s\)State value function in IQL\.ℒV\\mathcal\{L\}\_\{V\}Expectile regression loss for learningV​\(s\)V\(s\)\.ℒQ\\mathcal\{L\}\_\{Q\}Bellman regression loss for learningQ​\(s,a\)Q\(s,a\)\.L2η​\(⋅\)L\_\{2\}^\{\\eta\}\(\\cdot\)Asymmetric squared loss for expectile regression\.η\\etaExpectile level controlling implicit maximization\(0\.5<η<1\)\(0\.5<\\eta<1\)\.γ\\gammaDiscount factor in value learning\.a∗a^\{\*\}Final selected action at inference time\.
### A\.2Algorithm Description

The complete workflow of our proposed framework, DRIVE, is summarized in Algorithm[1](https://arxiv.org/html/2606.14192#alg1)\. The procedure begins by training the GMM\-based policyπθ\\pi\_\{\\theta\}and the value\-based criticQϕQ\_\{\\phi\}on the offline dataset𝒟\\mathcal\{D\}\. Subsequently, a retrieval indexℐ\\mathcal\{I\}is constructed using representative state embeddings\. Finally, the inference process integrates generative sampling, RTG\-guided retrieval, and conservative value evaluation to select the optimal actiona∗a^\{\*\}\.

Algorithm 1The DRIVE Framework0:Offline dataset

𝒟\\mathcal\{D\}; Number of GMM components

MM; Target return

R^t\\hat\{R\}\_\{t\};

0:Hyperparameters:

LL\(gen samples\),

KpoolK\_\{\\text\{pool\}\}\(sim pool\),

KK\(retrieval final\)

1:// Phase 1: Policy Training

2:Initialize Policy

πθ\\pi\_\{\\theta\}
3:whilepolicy not convergeddo

4:Sample batch of trajectories

τ∼𝒟\\tau\\sim\\mathcal\{D\}
5:Update Policy

πθ\\pi\_\{\\theta\}by minimizing

ℒGMM\\mathcal\{L\}\_\{\\text\{GMM\}\}\(Eq\.[6](https://arxiv.org/html/2606.14192#S4.E6)\)

6:endwhile

7:// Phase 2: Retrieval Index Construction

8:Initialize

ℐ←∅\\mathcal\{I\}\\leftarrow\\emptyset
9:for

τ∈𝒟\\tau\\in\\mathcal\{D\}do

10:Compute state embeddings

hth\_\{t\}\(Eq\.[8](https://arxiv.org/html/2606.14192#S4.E8)\)

11:Store

\(ht→\{at,R^t\}\)\(h\_\{t\}\\to\\\{a\_\{t\},\\hat\{R\}\_\{t\}\\\}\)into

ℐ\\mathcal\{I\}
12:endfor

13:// Phase 3: Critic Training

14:Initialize Critic

Qϕ,VψQ\_\{\\phi\},V\_\{\\psi\}
15:whilecritic not convergeddo

16:Sample transitions

\(s,a,r,s′\)∼𝒟\(s,a,r,s^\{\\prime\}\)\\sim\\mathcal\{D\}
17:Update

VψV\_\{\\psi\}via expectile regression \(Eq\.[11](https://arxiv.org/html/2606.14192#S4.E11)\)

18:Update

QϕQ\_\{\\phi\}via Bellman error \(Eq\.[12](https://arxiv.org/html/2606.14192#S4.E12)\)

19:endwhile

20:Evaluation:

21:foreach decision step

ttdo

22:Encoding:Get

hth\_\{t\}and GMM params via

πθ​\(τ0:t−1,R^t,st\)\\pi\_\{\\theta\}\(\\tau\_\{0:t\-1\},\\hat\{R\}\_\{t\},s\_\{t\}\)
23:Generation:

𝒜gen←\{a\(l\)∼∑αm​𝒩​\(μm,σm2\)\}l=1L\\mathcal\{A\}\_\{\\text\{gen\}\}\\leftarrow\\\{a^\{\(l\)\}\\sim\\sum\\alpha\_\{m\}\\mathcal\{N\}\(\\mu\_\{m\},\\sigma\_\{m\}^\{2\}\)\\\}\_\{l=1\}^\{L\}
24:Retrieval:

25:Query neighbors:

𝒞pool←Top\-​Kpoolsim​\(ℐ,ht\)\\mathcal\{C\}\_\{\\text\{pool\}\}\\leftarrow\\text\{Top\-\}K\_\{\\text\{pool\}\}^\{\\text\{sim\}\}\(\\mathcal\{I\},h\_\{t\}\)
26:Filter by RTG:

𝒜ret←Top\-​KR^​\(𝒞pool\)\\mathcal\{A\}\_\{\\text\{ret\}\}\\leftarrow\\text\{Top\-\}K^\{\\hat\{R\}\}\(\\mathcal\{C\}\_\{\\text\{pool\}\}\)
27:Execution Phase:

28:Pool:

𝒜cand←𝒜gen∪𝒜ret\\mathcal\{A\}\_\{\\text\{cand\}\}\\leftarrow\\mathcal\{A\}\_\{\\text\{gen\}\}\\cup\\mathcal\{A\}\_\{\\text\{ret\}\}
29:Select: optimal action

a∗a^\{\*\}\(Eq\.[14](https://arxiv.org/html/2606.14192#S4.E14)\)

30:Execute

a∗a^\{\*\}
31:endfor

### A\.3Retrieval Implementation Details

We facilitate efficient and effective retrieval through a rigorous filtering strategy and a high\-performance approximate nearest neighbor search implementation\.

Index Construction and Filtering\.To ensure the retrieval mechanism provides high\-quality guidance, we implement a lightweight filtering step during index construction\. Specifically, we exclude transitions associated with a zero Return\-to\-Go, whereR^t=0\\hat\{R\}\_\{t\}=0\. These instances typically correspond to unsuccessful bids or impressions that yielded no value, rendering them uninformative for policy improvement\. Filtering these samples not only improves the relevance of retrieved candidates but also reduces the memory footprint of the index\.

Scalable Implementation\.To ensure scalability and low\-latency inference, particularly for large\-scale trajectory datasets, we implement the retrieval mechanism using the FAISS library\(Johnsonet al\.,[2019](https://arxiv.org/html/2606.14192#bib.bib49)\)\. Given the contextual embeddingshth\_\{t\}produced by the Transformer encoder, we construct the retrieval index using the Hierarchical Navigable Small World \(HNSW\) algorithm\. Specifically, the inner product metric is employed for similarity calculation\. Since all embeddings are pre\-normalized, this effectively performs Cosine Similarity search\. The HNSW graph structure allows for logarithmic\-time complexity during queries, offering a superior trade\-off between search speed and retrieval recall compared to exact search methods\. To further enhance robustness during inference, we employ an oversampling strategy where3×K3\\times Knearest neighbors are initially retrieved based on cosine similarity\. From this expanded pool, we filter out invalid entries and select the finalKKcandidates with the highest stored RTG values\.

### A\.4Hyperparameters Setting

We summarize the comprehensive hyperparameter settings and architectural details for all components in Table[7](https://arxiv.org/html/2606.14192#A1.T7)\.

AuctionNet Configuration\.We adopt a dual\-model strategy for the AuctionNet environment to balance decision\-making capacity with retrieval efficiency\. The main policy network is configured with a high\-capacity backbone to capture the complex, multimodal decision boundaries inherent in auto\-bidding\.In contrast, the auxiliary encoder is designed as a lightweight architecture with a reduced embedding dimension\. This compression is critical for minimizing the storage footprint of the vector index and accelerating similarity search\. While the main network uses a sliding window \(K=20K=20\) for efficient inference, we set the context length of the retrieval encoder to 48, matching the full episode length of the AuctionNet environment\. This design allows the encoder to attend to the complete history of an advertising period, generating a global and comprehensive trajectory representation for effective indexing\.

D4RL Configuration\.For the D4RL benchmarks, we strictly adhere to the standard Decision Transformer architecture to ensure a fair comparison with baseline methods\. Unlike the AuctionNet setup, the latent space in Gym tasks is sufficiently compact for direct vector search\. Consequently, as noted in the table, we do not employ a separate retrieval encoder for these tasks\. Instead, we directly utilize the contextualized representations output by the policy backbone as query vectors\.

Table 7:Hyperparameter and Architecture Comparison\. Summary of the model configurations for the main AuctionNet policy, the auxiliary encoder used for retrieval, and the model used for Gym locomotion tasks\.ConfigurationAuctionNet \(Policy\)AuctionNet \(Encoder\)D4RL\(No Separate Encoder\)Activation FunctionReLUReLUGELUEmbedding Dim \(dd\)51264128Number of Layers633Attention Heads841Dropout0\.10\.10\.1Context Length \(KK\)204820Max Episode Length48481000OptimizerAdamWAdamWAdamWLearning Rate1×10−51\\times 10^\{\-5\}1×10−31\\times 10^\{\-3\}1×10−41\\times 10^\{\-4\}Weight Decay1×10−41\\times 10^\{\-4\}1×10−31\\times 10^\{\-3\}1×10−41\\times 10^\{\-4\}Warmup Steps10,00010,00010,000Batch Size25625664Critic Implementation Details\.We detail the specific hyperparameter configurations and architectural choices for the IQL critic in Table[8](https://arxiv.org/html/2606.14192#A1.T8)\. Unlike the standard wide MLP architectures typically employed in D4RL continuous control benchmarks, we adopt a more compact network structure optimized for the AuctionNet environment\. Specifically, the V\-network uses a tapered structure \(decreasing hidden sizes\) to efficiently process the auction state features\. The optimization hyperparameters, including learning rates and soft update frequencies, were also calibrated to ensure training stability and convergence in this domain\.

Table 8:Critic Hyperparameter Comparison between AuctionNet and D4RL Benchmarks\. Summary of architecture and optimization details for IQL \(Critic\) modules\.HyperparameterAuctionNet \(Ours\)D4RL BenchmarksQ\-Network Hidden Sizes\[64, 64\]\[256, 256\]V\-Network Hidden Sizes\[128, 64, 32\]\[256, 256\]Activation FunctionReLUReLUOptimizerAdamAdamCritic Learning Rate1×10−41\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}Soft Update Rate \(τ\\tau\)0\.010\.005Expectile \(η\\eta\)0\.70\.7Discount Factor \(γ\\gamma\)0\.990\.99

## Appendix BExperiment Details

In this section, we detail the experimental setups used to evaluate the proposed DRIVE framework\. We primarily utilize the industrialAuctionNetdataset to assess performance in realistic auto\-bidding scenarios, with a specific focus on the challenges posed by sparse rewards and long\-tail action distributions\. Additionally, to demonstrate the method’s universality beyond bidding, we extend our evaluation to the standardD4RLcontinuous control benchmarks\. The specifications, state features, and distributional characteristics of these environments are described below\.

### B\.1Dataset Details

We primarily evaluate DRIVE on the AuctionNet dataset for auction bidding\. To assess its generalization capabilities, we further extend our evaluation to the D4RL benchmark for continuous control\. Visualizations of these tasks are provided in Figure[5](https://arxiv.org/html/2606.14192#A2.F5)\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x5.png)Figure 5:Visualizations of the AuctionNet and D4RL benchmarks\.AuctionNet\.AuctionNet has two versions, with AuctionNet\-sparse characterized by sparse rewards\. Each dataset is organized into multiple delivery periods\. Every period encompasses approximately 500,000 impression opportunities and is temporally segmented into 48 decision steps\. Detailed parameters are shown in Table[9](https://arxiv.org/html/2606.14192#A2.T9)\.

Table 9:The Parameters of AuctionNet and AuctionNet\-sparse\.ParamsAuctionNetAuctionNet\-SparseTrajectories479,376479,376Delivery Periods9,9879,987Time steps in a trajectory4848State dimension1616Action dimension11Return\-To\-Go Dimension11Action range\[0, 493\]\[0, 589\]Impression’s value range\[0, 1\]\[0, 1\]CPA range\[6, 12\]\[60, 130\]Total conversion range\[0, 1512\]\[0, 57\]The trajectory\-formatted data is aggregated from the raw traffic logs and captures the decision\-making process by recording the information for multiple advertisers at every step\. The detailed state is provided below:

- •time\_left: Represents the remaining time in the current delivery period\.
- •budget\_left: Indicates the advertiser’s remaining budget available for the current period\.
- •historical\_bid\_mean: The average bid price placed by the advertiser across all preceding time steps\.
- •last\_three\_bid\_mean: The moving average of the advertiser’s bid prices over the most recent three time steps\.
- •historical\_LeastWinningCost\_mean: The historical average of the market price \(minimum cost to win an impression\) observed in previous steps\.
- •historical\_pValues\_mean: The average conversion probability \(p\-value\) of impressions in the past time steps\.
- •historical\_conversion\_mean: The average number of conversion events achieved by the advertiser in prior steps\.
- •historical\_xi\_mean: The historical winning rate, calculated as the average binary winning status, where 1 represents winning the impression and 0 represents not winning\.
- •last\_three\_LeastWinningCost\_mean: The average of the least winning costs over the last three time steps\.
- •last\_three\_pValues\_mean: The average conversion probability of impressions over the last three time steps\.
- •last\_three\_conversion\_mean: The average number of conversions obtained during the last three time steps\.
- •last\_three\_xi\_mean: The recent winning rate, representing the average winning status over the last three time steps\.
- •current\_pValues\_mean: The average conversion probability of all impression opportunities in the current time step\.
- •current\_pv\_num: The total volume of impression opportunities available at the current time step\.
- •last\_three\_pv\_num\_total: The cumulative number of impression opportunities served over the last three time steps\.
- •historical\_pv\_num\_total: The total accumulated count of impression opportunities over past time steps\.

The experiments are performed in a simulation environment resembling a real\-world commercial advertising system\(Suet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib36)\)\. An episode represents one delivery day, segmented into 48 time steps, with a traffic volume of approximately 500,000 impressions\. The competition landscape consists of 48 advertisers with varying budgets and CPA constraints\. In the evaluation phase, our model controls a target advertiser\. To rigorously evaluate performance robustness, we conduct repeated trials using different advertiser profiles and delivery periods, taking the average value as the final evaluation score\.

To comprehensively evaluate the complexity shift between the dense and sparse settings, we visualize the action distributions of both the original AuctionNet and the AuctionNet\-Sparse datasets\. As illustrated in Figure[6](https://arxiv.org/html/2606.14192#A2.F6), the comparison reveals fundamental structural differences that underscore the difficulty of the sparse control task\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x6.png)Figure 6:Comparison of Action Distributions: AuctionNet vs\. AuctionNet\-Sparse\. The top row shows the original dense dataset, while the bottom row depicts the sparse variant\. The log\-scale plots \(right column\) reveal the extreme long\-tail property of the sparse dataset, where the action space extends significantly with a drastic increase in kurtosis\. This shift highlights the exploration difficulty in the sparse setting\.While the dense dataset concentrates actions within a relatively compact range, the sparse variant exhibits a significantly broader action coverage\. The linear\-scale plots \(left column\) demonstrate that although the majority of bidding actions remain in the lower value region for both datasets, the sparse setting requires the agent to generalize over a much wider and sparser manifold\. The logarithmic\-scale plots \(right column\) highlight the heavy\-tailed nature of the distributions\. The sparse dataset demonstrates a more pronounced long\-tail property compared to the dense baseline\. This structural shift implies that the agent must handle a higher degree of distributional shift and learn to retrieve valid high\-value actions that are statistically rare but critical for optimal performance\. These observations justify the necessity of employing robust generative baselines capable of modeling multi\-modal distributions and handling extreme outliers, rather than relying on simple unimodal regression objectives\.

D4RL\.D4RL \(Datasets for Deep Data\-Driven Reinforcement Learning\) serves as a standardized benchmark suite for offline reinforcement learning\. In the domain of high\-dimensional continuous control, the Gym\-MuJoCo tasks simulate articulated robots with varying degrees of freedom:hopper\(a monoped\),walker2d\(a planar biped\), andhalfcheetah\(a two\-legged cheetah\-like robot\)\. The objective in these environments is to master stable locomotion behaviors to maximize forward velocity while minimizing control costs\. Complementing these, the Maze2D domain focuses on sparse\-reward navigation, requiring a 2D agent to traverse complex geometric layouts to reach specified target coordinates\. The benchmark provides datasets of diverse qualities to test the algorithm’s generalization\.

Applicability of GMM to General Offline RL\.Although DRIVE is motivated by auto\-bidding, the challenges it addresses, multimodal action distributions and data heterogeneity, are ubiquitous in general offline RL benchmarks\. In Gym\-MuJoCo tasks, particularly in suboptimal datasets likemedium\-replay, the data consists of a mixture of high\-performing expert behaviors and low\-quality exploratory noise\. A unimodal policy risks averaging these conflicting behaviors into a mediocre action\. DRIVE’s distributional modeling, combined with critic\-guided selection, allows the agent to disentangle optimal behaviors from noisy data, effectively recovering high\-reward actions from mixed\-quality distributions\. In Maze2D tasks, the optimal policy is inherently multimodal, such as bypassing obstacles from either left or right\. Standard deterministic regressors tend to output invalid interpolated actions like crashing into the wall, whereas our GMM head explicitly captures these distinct valid trajectories\.

## Appendix CAdditional Results

### C\.1Visualization of Multi\-Modal Action Distributions

To provide a concrete intuition for the limitations of deterministic policies in offline RL, we select and visualize specific instances where the retrieved neighborhood exhibits complex, multi\-modal structures\. Figure[7](https://arxiv.org/html/2606.14192#A3.F7)presents six representative examples identified from the evaluation set\. These cases were chosen to explicitly illustrate the “Average Action” Trap, where the distribution of candidate actions contains distinct modes rather than a single unimodal cluster\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x7.png)Figure 7:Representative retrieval neighborhoods illustrating the discrepancy between the DT prediction and the optimal action\. Each panel shows the empirical distribution of actions, the mean action predicted by the DT head \(blue vertical line, “DT \(Mean\)”\), and the action that achieves the highest return in that neighborhood \(red vertical line, “Optimal”\)\. “Gap” denotes the absolute difference between these two actions\.In these visualizations, the gray curves represent the empirical distribution of actions found in the retrieved history\. The blue vertical lines denote the expected action predicted by a standard Decision Transformer head \(trained via MSE\), which mathematically converges to the conditional mean of the distribution\. Crucially, we observe that the return\-maximizing actions \(red vertical lines\) consistently reside within the high\-density modes\. We define the*gap*as

Gap=\|aDT mean−aoptimal\|,\\text\{Gap\}=\\lvert a\_\{\\text\{DT mean\}\}\-a\_\{\\text\{optimal\}\}\\rvert,The significant Gap between the mean prediction and the optimal action highlights the failure mode of deterministic heads, which average over conflicting strategies, resulting in falling into regions corresponding to suboptimal behaviors\.

This qualitative evidence underscores the necessity of the GMM policy head used in DRIVE, which is designed to capture these distinct modes explicitly rather than collapsing them into a single mean\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x8.png)Figure 8:Qualitative comparison of trajectories on Maze2D\. The top row displays results onmaze2d\-medium, and the bottom row onmaze2d\-large\. Green solid lines denote DRIVE \(Ours\), and red dashed lines denote the DT baseline\. DRIVE consistently generates collision\-free paths to the goal without manual seed selection\.
### C\.2Visualization on Maze2D

In addition to the quantitative evaluations presented in the main text, Figure[8](https://arxiv.org/html/2606.14192#A3.F8)presents a qualitative analysis of the learned policies via rollout trajectories on themaze2d\-mediumandmaze2d\-largetasks\. To ensure an unbiased comparison, manual curation of successful cases is strictly avoided\. Instead, four random seeds are uniformly sampled for each environment variant\. For each seed, the environment is initialized with fixed start and goal positions, followed by the execution of a single evaluation episode for both the DRIVE agent and the DT baseline\. All eight resulting trajectory pairs are visualized directly without selection\. As illustrated, the DRIVE agent consistently plans shorter and more stable paths to navigate around obstacles, successfully reaching the goal in all sampled scenarios\. In contrast, the DT baseline frequently fails to find feasible paths, resulting in collisions with walls or stagnation in suboptimal regions\. These visualizations qualitatively corroborate the significant performance improvements and superior long\-horizon planning capabilities of DRIVE observed in the quantitative results\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x9.png)Figure 9:Q\-function multimodality analysis across delivery periods P7–P13 \(2,000 randomly sampled states\)\. \(a\) Proportion of states with multimodal Q\-functions \(≥2\\geq 2peaks\)\. \(b\) DT suboptimal rate \(Q\-percentile<80%<80\\%\) for multimodal vs\. unimodal states\. \(c\) Average DT distance toa∗a^\{\*\}for multimodal vs\. unimodal states\.
### C\.3Q\-Function Multimodality Analysis

To provide a more detailed view of Q\-function multimodality, we analyze a per\-period breakdown across delivery periods P7–P13\. Figure[9](https://arxiv.org/html/2606.14192#A3.F9)illustrates the per\-period results, providing a detailed view of multimodal prevalence, DT suboptimal rates, and distances to the optimal action\. Overall, multimodal states consistently constitute a notable fraction of the state space and are systematically more challenging: their DT suboptimal rates are higher, and the average distance to the optimal action is larger compared to unimodal states\. These trends persist across all delivery periods, illustrating that the severity of the Average Action Trap is generally consistent over time\. This analysis further motivates the use of a GMM\-based policy to preserve multiple bidding modes and mitigate systematic failures in offline auto\-bidding\.

### C\.4Generalization across DT\-style Architectures

While our primary analysis centers on the standard Decision Transformer, we contend that the “Average Action” issue is a systemic limitation inherent to the entire family of sequence modeling policies\. To validate the generalization capabilities of DRIVE across this broader class, we incorporated it into three representative Transformer\-based baselines, the original DT\(Chenet al\.,[2021](https://arxiv.org/html/2606.14192#bib.bib23)\), CDT\(Liuet al\.,[2023](https://arxiv.org/html/2606.14192#bib.bib48)\), and PDiT\(Maoet al\.,[2024](https://arxiv.org/html/2606.14192#bib.bib39)\)\. Additionally, we include Behavior Cloning \(BC\) to serve as a fundamental regression benchmark\.

Table[10](https://arxiv.org/html/2606.14192#A3.T10)presents the detailed numerical comparison on the AuctionNet dataset\. Despite their architectural differences, all DT\-style baselines suffer from mode collapse in multimodal auction landscapes\. By augmenting them with DRIVE’s distributional head and retrieval mechanism, we observe consistent and significant performance gains across all budget settings\. Most notably, the PDiT backbone achieves the largest improvement, confirming that our framework effectively complements sophisticated sequence models by providing explicit, high\-quality historical anchors to rectify generative hallucinations\.

Table 10:Evaluation of Generalization across Architectures\. The DRIVE module is integrated into four distinct policy backbones \(BC, PDIT, CDT, and DT\) on the AuctionNet dataset\. We report the total rewards \(mean±\\pmstandard deviation\) over 5 seeds\.DatasetBudgetBCBC\+DRIVEPDITPDIT\+DRIVECDTCDT\+DRIVEDTDT\+DRIVE\(ours\)AuctionNet50%201±1\.30201\\pm 1\.30𝟐𝟎𝟖±1\.48\\bm\{208\\pm 1\.48\}187±1\.14187\\pm 1\.14𝟐𝟎𝟕±0\.79\\bm\{207\\pm 0\.79\}208±2\.06208\\pm 2\.06𝟐𝟏𝟖±2\.18\\bm\{218\\pm 2\.18\}208±1\.75208\\pm 1\.75𝟐𝟏𝟐±1\.57\\bm\{212\\pm 1\.57\}75%283±1\.18283\\pm 1\.18𝟐𝟖𝟕±2\.64\\bm\{287\\pm 2\.64\}267±4\.47267\\pm 4\.47𝟑𝟎𝟎±2\.00\\bm\{300\\pm 2\.00\}300±2\.00300\\pm 2\.00𝟑𝟎𝟒±2\.03\\bm\{304\\pm 2\.03\}298±2\.21298\\pm 2\.21𝟐𝟗𝟕±2\.25\\bm\{297\\pm 2\.25\}100%358±2\.72358\\pm 2\.72𝟑𝟕𝟏±3\.13\\bm\{371\\pm 3\.13\}328±3\.51328\\pm 3\.51𝟑𝟖𝟕±4\.94\\bm\{387\\pm 4\.94\}382±3\.19382\\pm 3\.19𝟒𝟎𝟗±2\.44\\bm\{409\\pm 2\.44\}373±3\.18373\\pm 3\.18𝟑𝟗𝟗±3\.74\\bm\{399\\pm 3\.74\}125%419±3\.38419\\pm 3\.38𝟒𝟒𝟎±4\.44\\bm\{440\\pm 4\.44\}381±1\.29381\\pm 1\.29𝟒𝟔𝟒±0\.17\\bm\{464\\pm 0\.17\}450±2\.68450\\pm 2\.68𝟒𝟖𝟖±3\.54\\bm\{488\\pm 3\.54\}430±2\.98430\\pm 2\.98𝟒𝟕𝟓±5\.33\\bm\{475\\pm 5\.33\}150%471±5\.11471\\pm 5\.11𝟓𝟏𝟎±2\.68\\bm\{510\\pm 2\.68\}423±2\.56423\\pm 2\.56𝟓𝟑𝟐±4\.94\\bm\{532\\pm 4\.94\}508±2\.67508\\pm 2\.67𝟓𝟓𝟑±3\.77\\bm\{553\\pm 3\.77\}477±2\.12477\\pm 2\.12𝟓𝟓𝟏±4\.64\\bm\{551\\pm 4\.64\}Avergae346\.3363\.2\\bm\{363\.2\}↑\\uparrow16\.9317\.2378\.0\\bm\{378\.0\}↑\\uparrow60\.8369\.6394\.4\\bm\{394\.4\}↑\\uparrow24\.8357\.2386\.6\\bm\{386\.6\}↑\\uparrow29\.4
### C\.5Effect of Constraint\-Aware Training\.

We examine the necessity of aligning the critic’s objective with the specific constraints of the environment\. Since the AuctionNet task imposes a CPA limit, we incorporate a corresponding penalty term into the critic and evaluate performance by analyzing the trade\-off between the “Overall Score” and constraint compliance\. In unconstrained settings, the critic would simply optimize the raw reward\. However, as shown in Figure[10](https://arxiv.org/html/2606.14192#A3.F10), a naive critic trained solely on raw conversions fails to internalize the cost of violations, leading to a high CPA Exceed Rate \(red line\)\. In contrast, by injecting the constraint\-specific penalty into the IQL objective, our method effectively shapes the value landscape to reflect the true task goal\. The critic learns to assign lower values to high\-CPA actions, effectively filtering out unsafe candidates during the selection phase\. This confirms that for constrained tasks, the value estimation can be explicitly tailored to the active constraints to ensure safety and optimality\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x10.png)Figure 10:Detailed Performance Analysis across Budgets\. Comparison of Average Score \(Left\) and Risk/Exceed Rate \(Right\) under varying budget constraints\. The constraint\-aware critic \(w/ CPA\) consistently maintains high rewards while significantly reducing the risk of budget violation compared to the baseline\.
### C\.6Impact of Critic Architecture\.

To justify the choice of the critic module in DRIVE, we conduct an ablation study comparing our IQL\-based critic with a CQL\-based critic\. Figure[11](https://arxiv.org/html/2606.14192#A3.F11)illustrates the total rewards under varying budget constraints\.

Both critics yield comparable performance\. The conservative nature of CQL is effective when the budget is tight, as avoiding overestimation is crucial for preventing budget depletion\. The IQL\-based critic significantly outperforms the CQL variant\. As the budget increases, the agent requires a more accurate estimation of the upper quantiles of the return distribution to identify high\-value opportunities\. IQL, which performs expectile regression, avoids the under\-estimation bias often observed in CQL, thereby providing better guidance for the retrieval and ranking process\. Based on these results, we adopt IQL as the default value estimator to ensure robust performance across all budget levels\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x11.png)Figure 11:Ablation study on Critic architecture\. Performance comparison between the CQL\-based critic \(Blue\) and the IQL\-based critic \(Orange\) across varying budget constraints\. The results indicate that while both architectures yield comparable performance in low\-budget regimes, the IQL critic demonstrates superior scalability and robustness, consistently outperforming the CQL variant as the budget increases\.
### C\.7Sensitivity to Sampling SizeLL\.

We investigate the impact of the number of sampled actionsLLon performance, as shown in Figure[12](https://arxiv.org/html/2606.14192#A3.F12)\. Overall, increasingLLfrom 1 to 32 does not result in monotonic improvements for both variants\. The performance gains from additional samples are marginal, and the curves remain relatively flat\. This indicates that the GMM\-based policy already produces reasonably good candidates even with a small sampling size\. Across all sampling sizes, the retrieval\-augmented variant consistently achieves higher average reward and exhibits smaller fluctuations than the pure generative baseline\. These results suggest that, while the overall method is not highly sensitive to the exact choice ofLL, retrieval provides complementary high\-quality candidates that stabilize performance and reduce the reliance on extensive sampling from the generative policy\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x12.png)Figure 12:Sensitivity analysis of sampling sizeLL\. Average Total Reward as the number of generated candidate actionsLLincreases\. Error bars denote the standard deviation over 5 seeds\. While the performance is robust to variations inLL, the retrieval\-augmented variant consistently outperforms the baseline across all sampling budgets\.
### C\.8Sensitivity to Number of Retrieved NeighborsKK\.

We investigate the impact of the number of retrieved neighborsKKon performance, as visualized in Figure[13](https://arxiv.org/html/2606.14192#A3.F13)\. Unlike the sampling size analysis, the “only retrieval” baseline exhibits a strong sensitivity toKK, showing monotonic performance improvements asKKincreases from 1 to 9\. This confirms that purely retrieval\-based approaches rely heavily on a larger candidate pool to ensure the coverage of high\-quality actions\. In contrast, DRIVE demonstrates remarkable robustness, maintaining consistently superior performance across all testedKKvalues\. Notably, our method achieves near\-optimal rewards even with a minimal context ofK=1K=1, significantly surpassing the baseline’s peak performance atK=9K=9\. These results indicate that the GMM\-based policy effectively extracts and generalizes from sparse retrieval signals, decoupling the system’s effectiveness from the strict quantity of retrieved neighbors and ensuring reliability even under constrained retrieval budgets\.

![Refer to caption](https://arxiv.org/html/2606.14192v1/x13.png)Figure 13:Sensitivity analysis of retrieved neighbor countKK\. Average Total Reward comparison between DRIVE and the retrieval\-only baseline\. Error bars denote standard deviation over 5 seeds\. While only retrieve relies on a largerKKfor performance, DRIVE demonstrates superior robustness, maintaining high rewards even with minimal retrieval context \(K=1K=1\)\.
### C\.9Computational Efficiency Analysis

In real\-time bidding \(RTB\) systems, strict latency constraints \(typically<50<50ms or<100<100ms\) are imposed to ensure timely ad delivery\. Therefore, it is crucial to verify that the performance gains of DRIVE do not come at the cost of prohibitive computational overhead\. We analyze the computational cost from both training and inference perspectives\.

Training Cost\.The training cost of DRIVE remains comparable to standard baseline methods\. The GMM\-based policy head introduces only a negligible increase in parameters compared to the Transformer backbone\. Furthermore, the construction of the retrieval index and the training of the offline critic are performed entirely offline\. Consequently, they do not impose any additional burden on the online training loop or real\-time infrastructure\.

Inference Latency\.To evaluate the real\-time feasibility of DRIVE, we conducted a rigorous latency comparison against the standard Decision Transformer baseline\. We measured the average wall\-clock time per decision step over three independent runs\. The DT baseline achieves an average latency of10\.4410\.44ms\. DRIVE, which incorporates distributional sampling, retrieval, and value evaluation, records an average latency of46\.3846\.38ms\. Although DRIVE introduces additional latency compared to the vanilla DT, the breakdown indicates that the combined time for candidate generation \(including GMM sampling and retrieval\) averages39\.0139\.01ms, while the critic evaluation takes only7\.377\.37ms\. Crucially, the total inference time consistently remains below5050ms\. This demonstrates that DRIVE strikes a favorable trade\-off, delivering significant performance improvements while satisfying the strict latency requirements of industrial RTB systems\.

Memory Usage\.We conducted a comparative analysis of memory consumption between DRIVE and the DT baseline\. The standard DT model requires a peak CPU RAM of 9\.36 GB, which represents the base cost for data loading and environment simulation\. In contrast, DRIVE records a peak CPU RAM of 28\.94 GB\. A detailed breakdown reveals that the majority of this overhead is explicitly allocated to the retrieval mechanism\. Specifically, the pre\-built FAISS index accounts for 13\.33 GB \(approx\. 68% of the incremental memory\)\. The remaining increase is attributed to auxiliary structures \(e\.g\., cached RTG values and action candidates\) and runtime buffers required for the retrieval\-augmented inference\. Given that modern industrial inference servers typically possess 64GB to 256GB of RAM, this memory footprint is well within acceptable limits for deployment\.

Similar Articles

Generative Auto-Bidding with Unified Modeling and Exploration

arXiv cs.AI

This paper introduces Guide, a framework that combines a Decision Transformer with Q-value guidance and an inverse dynamics module to balance exploration and safety in automated bidding for digital advertising, demonstrating effectiveness on public datasets and simulated auctions.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arXiv cs.CL

AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.