Generative Auto-Bidding with Unified Modeling and Exploration
Summary
This paper introduces Guide, a framework that combines a Decision Transformer with Q-value guidance and an inverse dynamics module to balance exploration and safety in automated bidding for digital advertising, demonstrating effectiveness on public datasets and simulated auctions.
View Cached Full Text
Cached at: 05/20/26, 08:29 AM
# Generative Auto-Bidding with Unified Modeling and Exploration Source: [https://arxiv.org/html/2605.19457](https://arxiv.org/html/2605.19457) Mingming ZhangKey Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University Taobao & Tmall Group of AlibabaWuhanChina[mingmingzhang@whu\.edu\.cn](https://arxiv.org/html/2605.19457v1/mailto:[email protected])Feiqing ZhuangTaobao & Tmall Group of AlibabaHangzhouChina[zhuangfeiqing\.zfq@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Na LiKey Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University Taobao & Tmall Group of AlibabaWuhanChina[wannal@whu\.edu\.cn](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Shengjie SunTaobao & Tmall Group of AlibabaHangzhouChina[shengjie\.ssj@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]%20),Xiaowei ChenTaobao & Tmall Group of AlibabaHangzhouChina[qisheng\.cxw@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Junxiong ZhuTaobao & Tmall Group of AlibabaHangzhouChina[xike\.zjx@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Fei XiaoTaobao & Tmall Group of AlibabaHangzhouChina[guren\.xf@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Keping YangTaobao & Tmall Group of AlibabaHangzhouChina[shaoyao@alibaba\-inc\.com](https://arxiv.org/html/2605.19457v1/mailto:[email protected]),Lixin ZouKey Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan UniversityWuhanChina[zoulixin@whu\.edu\.cn](https://arxiv.org/html/2605.19457v1/mailto:[email protected])andChenliang LiKey Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan UniversityWuhanChina[cllee@whu\.edu\.cn](https://arxiv.org/html/2605.19457v1/mailto:[email protected]) \(2026\) ###### Abstract\. Automated bidding has become a core component of modern digital advertising\. Early methods were primarily rule\-based, while easy to implement, they struggled to adapt to rapidly changing environments\. Subsequent Reinforcement Learning methods modeled bidding as a Markov Decision Process but were limited in their ability to capture long\-term dependencies\. While recent generative models have shown encouraging progress, they generally lack explicit mechanisms to balance exploration and safety\. They often rely solely on simple action perturbations or trajectory guidance to foster bidding exploration, and critically, they lack a safety fallback mechanism\. This limitation leads to inefficient exploration and significantly increases the financial risk for advertising platforms\. To bridge this gap, we propose a new framework namedGenerative Auto\-Bidding withUnified Modeling andExploration \(Guide\), which synergistically integrates directed exploration with a safe fallback mechanism\.Guideutilizes a Decision Transformer \(DT\) to jointly model historical bidding actions and environmental state transitions\. A Q\-value module guides the DT’s exploration through regularization constraints\. Concurrently, an Inverse Dynamics Module \(IDM\) leverages the future states predicted by the DT to infer robust and behaviorally consistent actions, thereby providing a safe policy fallback\. The Q\-value module then adaptively selects the final action from these two options, balancing exploration and safety\. Together, these three components form an integrated ”explore–safeguard–select” pipeline, unifying efficiency and safety\. We conduct comprehensive experiments on public datasets, in simulated auction environments, and through a large\-scale online deployment on Taobao, a leading advertising platform in China\. The results demonstrate thatGuideconsistently outperforms state\-of\-the\-art \(SOTA\) baseline methods across all scenarios\. In real\-world online deployment,Guideachieves remarkable improvements: \+4\.10% in ad GMV, \+1\.40% in ad clicks, \+1\.66% in ad cost, and \+3\.52% in ad ROI, demonstrating its effectiveness and strong industrial applicability\. Auto Bidding, Generative Decision Model, Decision Transformer ††copyright:acmlicensed††journalyear:2026††copyright:cc††conference:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 20–24, 2026; Melbourne, VIC, Australia\.††booktitle:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’26\), July 20–24, 2026, Melbourne, VIC, Australia††doi:3805712\.3809661††isbn:979\-8\-4007\-2599\-9/2026/07††ccs:Applied computing Online auctions††ccs:Information systems Computational advertising## 1\.Introduction With the rapid evolution of the digital advertising ecosystem, the global online advertising market has already reached the hundred\-billion\-dollar scale in 2025, traditional manual Ad bidding methods struggle to meet the demands for real\-time response and large\-scale optimization\(Borissovet al\.,[2010](https://arxiv.org/html/2605.19457#bib.bib36); Wenet al\.,[2022](https://arxiv.org/html/2605.19457#bib.bib37)\)\. Automated bidding technology not only enhances the efficiency of ad delivery but also allows for more precise budget allocation and resource management according to different marketing objectives, such as clicks, conversions, or return on investment\(Zhanget al\.,[2014](https://arxiv.org/html/2605.19457#bib.bib29); Liet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib30); Liuet al\.,[2020](https://arxiv.org/html/2605.19457#bib.bib31); Yuanet al\.,[2022](https://arxiv.org/html/2605.19457#bib.bib35); Zhanget al\.,[2023](https://arxiv.org/html/2605.19457#bib.bib39); Yuanet al\.,[2013](https://arxiv.org/html/2605.19457#bib.bib18); Li and Tang,[2022](https://arxiv.org/html/2605.19457#bib.bib32)\)\. Its growing importance in improving advertising effectiveness and reducing operational costs has made it one of the core tools in contemporary advertising strategies\. Figure 1\.Different Modeling Approaches in Ad Bidding\.ata\_\{t\}andsts\_\{t\}denotes the actions and states respectively, anda^t∗\\hat\{a\}\_\{t\}^\{\*\}denotes the better action\. Q donates Q\-value Module\.Early automated bidding approaches often relied on rule\-based strategies, such as PID control\. Although these methods are easy to implement, they lack the ability to adapt to dynamic advertising environments\. To address these limitations, reinforcement learning has been widely applied to automated bidding tasks, modeling them as a Markov Decision Process \(MDP\)\(Puterman,[2014](https://arxiv.org/html/2605.19457#bib.bib40); Boutilieret al\.,[1999](https://arxiv.org/html/2605.19457#bib.bib41)\), where the advertiser’s bidding behavior in each auction is treated as a decision action, and optimal actions are chosen based on the environmental state, such as user characteristics, accumulated rewards, and market dynamics\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.19457#bib.bib33); Kostrikovet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib8)\)\. However, since MDPs consider only the current state and action, they often fail to adequately capture complex temporal dependencies and dynamics of the advertising environment, making it difficult to make accurate decisions when faced with long\-term dependencies and complex user behavior patterns\(Caiet al\.,[2017](https://arxiv.org/html/2605.19457#bib.bib19)\)\. Recently, generative models have demonstrated the ability to effectively model complex historical dependencies and are able to discover improved bidding strategies, making them a focal point of current research\(Jianget al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib38); Gaoet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib16); Liet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib15); Guoet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib13)\)\. Models based on Decision Transformer \(DT\)\(Chenet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib12)\), such as GAS\(Liet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib15)\)and GAVE\(Gaoet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib16)\), model the sequence of bidding actions, while those based on Decision Diffusion \(DD\)\(Luet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib34)\), such as AIGB\(Guoet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib13)\)and EGDB\(Penget al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib14)\), model the sequence of advertising environment states\. These models have achieved impressive results by integrating thoughtfully designed exploration strategies\. However, underlying these advancements lies a fundamental challenge: how to encourage exploration while ensuring the financial safety of the platform\. It is well known that there is an inherent tension between exploration and reliability\. In the high\-stakes environment of ad auctions, where every cent is critical, unconstrained exploration is tantamount to gambling\. While existing approaches promote exploration through techniques such as action perturbation and value guidance, they generally lack an explicit safety fallback mechanism\. When the model explores into unknown or perilous policy spaces, the system is unable to revert to a known, robust baseline policy, rendering the exploration process not only inefficient but also exceptionally risky\. This leads to a critical open question:how can we design a unified framework that synergistically integrates directed, efficient exploration with a robust safety fallback mechanism, thereby achieving both high performance and operational reliability? To address this challenge, we proposeGenerative Auto\-Bidding withUnified Modeling andExploration \(denoted asGuide\), a unified framework that integrates exploration effectiveness and safety\. As illustrated in Figure[1](https://arxiv.org/html/2605.19457#S1.F1),Guidejointly models environmental dynamics and historical bidding action sequences, complemented by a Q\-value\-based action optimization and selection module to balance exploration and safety\. Specifically, we employ the DT as the backbone network to simultaneously generate future state trajectories and candidate bidding action sequences, thereby gaining a deeper understanding of the current bidding environment\. To enable directed exploration, we integrate a Q\-value module that guides the exploration direction of the DT through regularization constraints\. Meanwhile, we introduce an Inverse Dynamics Module \(IDM\), which leverages the DT\-predicted future states to infer plausible bidding actions from the transition between the current and predicted states\. By design, the DT boldly explores potentially high\-reward strategies, while the IDM effectively imitates the behavioral policy embedded in the training data, producing safer and more stable actions that serve as a reliable fallback during high\-risk exploration\. The Q\-value module further adaptively selects actions between those proposed by the DT and the IDM, ensuring a balanced trade\-off between exploration and safety\. These three components work in concert to enable efficient bidding exploration with safety guarantees, leading to smarter and more robust automated bidding\. To enable effective model optimization, we also utilize a two\-stage training procedure for efficient model learning\. We conduct comprehensive evaluations ofGuideon public offline datasets and simulated advertising auction environments, and the results show thatGuidesignificantly outperforms existing state\-of\-the\-art baseline methods in all settings\. Furthermore, we deployedGuideon Taobao, one of the largest e\-commerce platform in China, and achieve improvement in Ad GMV by 4\.10%, Ad clicks by 1\.40%, Ad cost by 1\.66%, and Ad ROI by 3\.52%, demonstrating the effectiveness and leading performance ofGuidein the field of automated bidding\. In summary, our contributions are as follows: - •We propose the first unified modeling paradigm that jointly captures environmental dynamics and bidding actions within a single generative framework, simultaneously modeling the evolution of the advertising environment and the sequence of historical bids\. This design significantly enhances understanding of complex, dynamic ad ecosystems for policy optimization\. - •We propose a novel bidding mechanism that integrates ”Exploration–Guarantee–Selection”, which effectively resolves the fundamental tension between exploration and safety in high\-risk advertising scenarios by organically combining three core components: an active exploration module based on Decision Transformer, a safety fallback module grounded in IDM, and a Q\-value\-based action selector\. - •We conduct comprehensive experiments on public offline datasets, simulated scenarios, and real\-world business environments\. The results show that our proposedGuidesignificantly outperforms existing state\-of\-the\-art auto\-bidding baselines across all metrics and settings\. ## 2\.Related Works In the domain of online advertising, automated bidding methods have evolved into four main categories: PID control, reinforcement learning, generative models, and LLM\-based agents\. Early bidding methods, which were based on PID control theory\(Chenet al\.,[2011](https://arxiv.org/html/2605.19457#bib.bib1); Yanget al\.,[2019](https://arxiv.org/html/2605.19457#bib.bib2); Zhanget al\.,[2016](https://arxiv.org/html/2605.19457#bib.bib3); Knospe,[2006](https://arxiv.org/html/2605.19457#bib.bib4); Boraseet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib5)\), suffered from several practical issues, most notably a heavy reliance on meticulous parameter tuning and a limited capacity for adapting to dynamic market environments\. To address these inherent limitations, the research community turned its attention to reinforcement learning, giving rise to more advanced bidding algorithms such as USCB\(Heet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib6)\)and SORL\(Mouet al\.,[2022](https://arxiv.org/html/2605.19457#bib.bib7)\)\. These algorithms leverage fundamental reinforcement learning techniques like IQL\(Kostrikovet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib8)\)and CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.19457#bib.bib9)\)to learn behavioral policies from ad log datasets, enabling fully automated bidding\. However, they remain relatively inefficient at fully utilizing the rich historical information in the logs\. Subsequently, generative approaches were introduced, which innovatively reframe the ad bidding task as a sequence generation problem\. Generative efforts can be further divided into two categories: those based on Decision Diffusion\(Ajayet al\.,[2022](https://arxiv.org/html/2605.19457#bib.bib10); Zhuet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib11)\)and those based on Decision Transformer\(Chenet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib12)\)\. AIGB\(Guoet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib13)\)pioneered a new generative bidding paradigm, using Decision Diffusion to model ad status sequences and an inverse dynamics model to generate actions\. Then EGDB\(Penget al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib14)\)introduced expert information to optimize the generated trajectories\. GAS\(Liet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib15)\)and GAVE\(Gaoet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib16)\), on the other hand, are based on Decision Transformer\(Chenet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib12)\)networks to generate bidding actions\. They designed Monte Carlo post\-training search strategy and value\-guided action generation optimization to enhance exploration respectively\. However, none of the aforementioned methods explicitly consider the trade\-off between exploration and safety\. With the advancement of LLMs, recent research such as RTBAgent\(Caiet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib17)\)has leveraged the planning capabilities of LLM agents to fine\-tune the actions of basic bidding models\. However, it still faces challenges such as excessive latency and low reproducibility\. Meanwhile, developing high\-fidelity advertising simulation environments is widely recognized as crucial for improving automated bidding strategies\. This importance stems from their unique capability to bridge the often\-substantial gap between analyses conducted on static offline data and the dynamic, unpredictable nature of real\-world auction scenarios\(Yuanet al\.,[2013](https://arxiv.org/html/2605.19457#bib.bib18); Caiet al\.,[2017](https://arxiv.org/html/2605.19457#bib.bib19); Jinet al\.,[2018](https://arxiv.org/html/2605.19457#bib.bib20)\)\. AuctionNet\(Suet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib22)\)introduced the first large\-scale advertising simulation framework with a traffic generator that replicates real\-world distributions\. More recently, the Bid2X\(Jiet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib21)\)project advanced this line of research by training the first large\-scale, Transformer\-based environment model\. By leveraging a wide array of diverse advertising datasets, this work not only created a powerful simulator but also made the significant demonstration that its performance scales in accordance with the well\-established scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.19457#bib.bib23)\), a principle newly verified within the advertising domain\. ## 3\.Preliminary ### 3\.1\.Definition of Auto\-Bidding Problem #### 3\.1\.1\.Problem Setting The goal of auto\-bidding is to determine a bidding sequence that maximizes the total value of acquired trafficviv\_\{i\}, subject to budget and CPA constraints\. Letxi∈\{0,1\}x\_\{i\}\\in\\\{0,1\\\}indicate if impressioniiis won\. The optimization problem is: \(1\)max\{bi\}i=1I∑i=1Ixivi\\max\_\{\\\{b\_\{i\}\\\}\_\{i=1\}^\{I\}\}\\sum\_\{i=1\}^\{I\}x\_\{i\}v\_\{i\}subject to: - •Budget Constraint:Total spend should not exceed the budgetBB: \(2\)∑i=1Ixici≤B\\sum\_\{i=1\}^\{I\}x\_\{i\}c\_\{i\}\\leq B - •CPA Constraint:The average cost per acquisition \(CPA\) should not exceed thresholdCC: \(3\)CPA=∑i=1Ixici∑i=1Ixivi≤CCPA=\\frac\{\\sum\_\{i=1\}^\{I\}x\_\{i\}c\_\{i\}\}\{\\sum\_\{i=1\}^\{I\}x\_\{i\}v\_\{i\}\}\\leq C wherecic\_\{i\}is the actual cost incurred for winning theii\-th impression, andviv\_\{i\}is the value generated by that impression\. The budget constraint is strictly enforced, while CPA constraints are typically soft, as they are evaluable only after the campaign ends\. #### 3\.1\.2\.Optimal Bidding Policy The optimal bid can then be expressed, via the complementary slackness theorem\(Dantzig,[2016](https://arxiv.org/html/2605.19457#bib.bib42)\), as a function of its value and the CPA threshold\(Yuet al\.,[2017](https://arxiv.org/html/2605.19457#bib.bib25)\): \(4\)bi∗=\(λ0∗\+λ1∗C\)vi=λ∗vib\_\{i\}^\{\*\}=\(\\lambda\_\{0\}^\{\*\}\+\\lambda\_\{1\}^\{\*\}C\)v\_\{i\}=\\lambda^\{\*\}v\_\{i\}whereλ0∗\\lambda\_\{0\}^\{\*\},λ1∗\\lambda\_\{1\}^\{\*\}are coefficients determined by the campaign’s budget and CPA requirements\. This formulation allows the bidding strategy to effectively balance value maximization and constraint satisfaction in dynamic auction environments\. ### 3\.2\.Sequence Modeling for Auto\-bidding Problem While the bidding policybi=λ∗vib\_\{i\}=\\lambda^\{\*\}v\_\{i\}provides a theoretically sound structure, the assumption of a single, static multiplierλ∗\\lambda^\{\*\}is insufficient for real\-world dynamic auction environments\. Market conditions like competitors’ bids and impression availability keep changing, so a fixedλ∗\\lambda^\{\*\}is no longer optimal over time\. A more powerful strategy is to adapt the multiplier at each time step,λt\\lambda\_\{t\}, based on the evolving campaign status and market feedback\. This need for dynamic adjustment naturally casts the auto\-bidding problem as asequential decision process, enabling the application of modern sequence modeling approaches such as the Decision Transformer\. This paradigm reformulates the problem as conditional sequence modeling\. The goal is to learn a model capable of generating high\-return trajectories\. Specifically, we aim to model the conditional probability of an actionata\_\{t\}, conditioned on the past history of states, actions, and rewards, as well as a desired future performance target\. The main components of this sequential formulation are: - •Statests\_\{t\}: feature vector summarizing bidding environment at timett\(e\.g\., remaining budget, time, and previous results\)\. - •Actionata\_\{t\}: adjustable bidding parameter for timett\(e\.g\.,at=λta\_\{t\}=\\lambda\_\{t\}\)\. - •Rewardrtr\_\{t\}: total value from impressions won inttforNtN\_\{t\}candidate impressions,rt=∑n=1Ntxnvnr\_\{t\}=\\sum\_\{n=1\}^\{N\_\{t\}\}x\_\{n\}v\_\{n\}\. - •Return\-to\-goRtR\_\{t\}: cumulative rewards fromtttoTT,Rt=∑t′=tTrt′R\_\{t\}=\\sum\_\{t^\{\\prime\}=t\}^\{T\}r\_\{t^\{\\prime\}\}\. The bidding process forms a trajectoryτ=\(s1,a1,r1,…,sT,aT,rT\)\\tau=\(s\_\{1\},a\_\{1\},r\_\{1\},\\ldots,s\_\{T\},a\_\{T\},r\_\{T\}\), suitable for sequence modeling approaches such as transformer or diffusion architectures\. This enables flexible and adaptive policy optimization in dynamic ad environments\. ## 4\.Method Here, we first introduce the design details for the two basic modules of our proposedGuide: DT and IDM as well as the two\-stage training procedure\. Then, we present the design of the Q\-value module\. Next, we describe the mechanism for action selection using the Q\-value module\. Finally, we make a summary towards the role of DT and IDM respectively\. The overall architecture is shown in Figure[2](https://arxiv.org/html/2605.19457#S4.F2)\. Figure 2\.Overview architecture\. a\) Training of the unified modeling framework\. b\) Inference with bid selection### 4\.1\.Unified Modeling of Bid Trajectories #### 4\.1\.1\.Trajectory Construction and Modeling In the auto\-bidding task, each round of bidding can be represented as a temporal trajectory that sequentially records the advertising environment states, bidding actions, and the resulting rewards\. Formally, a trajectory can be represented as follows: \(5\)τ=\(s1,a1,r1,s2,a2,r2,…,sT,aT,rT\)\\tau=\(s\_\{1\},a\_\{1\},r\_\{1\},s\_\{2\},a\_\{2\},r\_\{2\},\.\.\.,s\_\{T\},a\_\{T\},r\_\{T\}\)wherests\_\{t\}denotes the environment state at time steptt,ata\_\{t\}andrtr\_\{t\}are the bidding action taken and the immediate reward received respectively at the same step\. To effectively capture historic information and long\-term dependencies, we adopt the Decision Transformer for historical sequence modeling\. At each time step, the DT module takes in the historical states, actions, and return\-to\-go as input features, and predicts the next action and next state in one go\. Specifically, at timett, the model makes the prediction as follows: \(6\)\(a^t,s^t\+1\)∼DT\(Rt−k\+1,st−k\+1,at−k\+1,…,Rt,st\)\(\\hat\{a\}\_\{t\},\\hat\{s\}\_\{t\+1\}\)\\sim DT\(R\_\{t\-k\+1\},s\_\{t\-k\+1\},a\_\{t\-k\+1\},\.\.\.,R\_\{t\},s\_\{t\}\)Different from the existing models like GAS\(Liet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib15)\)and GAVE\(Gaoet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib16)\), we choose to jointly generate the next actiona^t\\hat\{a\}\_\{t\}and the subsequent environment states^t\+1\\hat\{s\}\_\{t\+1\}\. This treatment could exploit more supervision signals for DT module training, explicitly guiding the model to capture high\-order state evolution\. Meanwhile, as described in the following, the estimateds^t\+1\\hat\{s\}\_\{t\+1\}works as the pivot to enable better modeling of transient state transitions by the IDM\. #### 4\.1\.2\.Inverse Dynamics Module After modeling the historical signals, we further incorporate an inverse dynamics module to infer the actions over the transient state transitions\. This design provides an alternative pathway for action generation, enhancing the diversity and robustness of the policy\. The IDM operates as follows: given two consecutive environment statessts\_\{t\}ands^t\+1\\hat\{s\}\_\{t\+1\}, it estimates the actiona^tidm\\hat\{a\}\_\{t\}^\{idm\}that could have led to this state transition\. The module is implemented as a neural networkfidmf\_\{idm\}, typically parameterized as a multilayer perceptron \(MLP\), which learns the inverse mapping: \(7\)a^tidm=fidm\(st,s^t\+1\)\\hat\{a\}\_\{t\}^\{idm\}=f\_\{idm\}\(s\_\{t\},\\hat\{s\}\_\{t\+1\}\)Here, to ensure consistency between training and inference, the input to the inverse dynamics model is the current statests\_\{t\}and the next states^t\+1\\hat\{s\}\_\{t\+1\}predicted by DT\. During training,fidmf\_\{idm\}is supervised to minimize the mean squared error between the inferred action and the true action recorded in the dataset\. Specifically, the loss function for the IDM is calculated as follows: \(8\)ℒidm=𝔼\(st,at\)∼𝒟\[‖fidm\(st,s^t\+1\)−at‖2\]\\mathcal\{L\}\_\{idm\}=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t\}\)\\sim\\mathcal\{D\}\}\\left\[\\left\\\|f\_\{idm\}\(s\_\{t\},\\hat\{s\}\_\{t\+1\}\)\-a\_\{t\}\\right\\\|^\{2\}\\right\]where𝒟\\mathcal\{D\}represents the training dataset\. This objective encourages the IDM to output action predictions that closely align with the actual actions observed in the real trajectories\. Beyond providing an alternative action source, this design serves a more profound purpose: it implicitly regularizes the state prediction of the Decision Transformer\. By tasking the Inverse Dynamics Model with inferring a plausible action from the transition\(st,s^t\+1\)\(s\_\{t\},\\hat\{s\}\_\{t\+1\}\), we force the DT to generate a future states^t\+1\\hat\{s\}\_\{t\+1\}that is physically reachable from the current statests\_\{t\}\. If the DT hallucinates a future state that is inconsistent with the environment’s underlying dynamics, the IDM will struggle to reconstruct the correct action, leading to a higher loss that backpropagates to the DT\. This feedback loop encourages the DT to learn a more realistic model of environmental evolution, thereby grounding its long\-term sequence generation in plausible, moment\-to\-moment state transitions\. #### 4\.1\.3\.Two\-Stage Training Our training procedure adopts a two\-stage paradigm to facilitate stable and efficient joint learning of DT and IDM\. Specifically, the process is divided into two phases: Phase 1: Separate Training\. During the initial phase, DT and the IDM are optimized independently\. Gradients from the IDM are prevented from propagating into the DT by detaching the predicted next state when computing the IDM loss\. Given the current statests\_\{t\}and the detached DT\-predicted next states^t\+1\\hat\{s\}\_\{t\+1\}, the IDM is trained with the following objective: \(9\)ℒidm′=𝔼\(st,at\)∼𝒟\[‖fidm\(st,stop\_grad\(s^t\+1\)\)−at‖2\]\\mathcal\{L\}\_\{idm\}^\{\\prime\}=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t\}\)\\sim\\mathcal\{D\}\}\\left\[\\left\\\|f\_\{idm\}\(s\_\{t\},\\mathrm\{stop\\\_grad\}\(\\hat\{s\}\_\{t\+1\}\)\)\-a\_\{t\}\\right\\\|^\{2\}\\right\]wherefidmf\_\{idm\}denotes the inverse dynamics model\. The DT is updated separately to minimize a behavior cloning loss for actions and a state prediction loss: \(10\)ℒdt=𝔼\[\(a^t−at\)2\+\(s^t\+1−st\+1\)2\]\\mathcal\{L\}\_\{dt\}=\\mathbb\{E\}\\left\[\(\\hat\{a\}\_\{t\}\-a\_\{t\}\)^\{2\}\+\(\\hat\{s\}\_\{t\+1\}\-s\_\{t\+1\}\)^\{2\}\\right\] Phase 2: Joint Training\. After the separate pre\-training, both DT and IDM are trained jointly\. The IDM loss is incorporated into the total objective for DT, and gradients are allowed to flow through both networks\. The training objective becomes: \(11\)ℒ=ℒdt\+ℒidm\\mathcal\{L\}=\\mathcal\{L\}\_\{dt\}\+\\mathcal\{L\}\_\{idm\}In this phase, the IDM takes the non\-detached DT\-predicted next states^t\+1\\hat\{s\}\_\{t\+1\}as input\. Compared with single\-stage training, this two\-phase pipeline prevents unstable gradient propagation in the early stage and enables the IDM to extract high\-quality inverse dynamics features\. By incorporating inverse dynamics supervision, this approach improves both action generation accuracy and trajectory consistency, helping the DT model learn robust action\-state correspondence, especially under complex environment transition scenarios\. ### 4\.2\.Q\-value\-based optimization Relying solely on supervised learning from the dataset can only lead to suboptimal behavioral policies, since it restricts exploration and prevents the discovery of better policies\. To encourage effective policy exploration within the proposed unified modeling framework, we introduce a Q\-value prediction module, implemented as a twin\-critic neural network\. #### 4\.2\.1\.Twin Q Networks and Target Networks Architecture Our Q\-value estimation module adopts a twin critic architecture, which consists of two independent Q networks, denoted asQ1Q\_\{1\}andQ2Q\_\{2\}\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.19457#bib.bib27)\)\. Each network takes a state\-action pair\(st,at\)\(s\_\{t\},a\_\{t\}\)as input and outputs the predicted cumulative expected return for taking actionata\_\{t\}in statests\_\{t\}\. In addition, each Q network has a corresponding target Q network, namelyQ1targetQ\_\{1\}^\{\\text\{target\}\}andQ2targetQ\_\{2\}^\{\\text\{target\}\}\. The parameters of the target networks are updated via an exponential moving average of the main network parameters, which helps stabilize training\. This twin Q network structure alleviates overestimation bias by usingmin\(Q1,Q2\)\\min\(Q\_\{1\},Q\_\{2\}\)for conservative value estimation and target calculation\. #### 4\.2\.2\.Critic Training Procedure During training, transitions \(st,at,rt,st\+1,at\+1\)\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\},a\_\{t\+1\}\)are sampled from the replay buffer, wherertr\_\{t\}is the reward anddtd\_\{t\}indicates whether the episode ends\. At each update step, we first compute the temporal difference \(TD\) targetyty\_\{t\}using the target networks as follows: \(12\)yt=rt\+γ\(1−dt\)min\{Q1target\(st\+1,at\+1\),Q2target\(st\+1,at\+1\)\},y\_\{t\}=r\_\{t\}\+\\gamma\(1\-d\_\{t\}\)\\min\\left\\\{Q\_\{1\}^\{\\text\{target\}\}\(s\_\{t\+1\},a\_\{t\+1\}\),Q\_\{2\}^\{\\text\{target\}\}\(s\_\{t\+1\},a\_\{t\+1\}\)\\right\\\},whereγ\\gammais the discount factor\. The current Q networksQ1Q\_\{1\}andQ2Q\_\{2\}estimate the value for the sampled state\-action pairs, and the critic loss is computed as the sum of mean squared errors: \(13\)ℒcritic=𝔼\[\(Q1\(st,at\)−yt\)2\+\(Q2\(st,at\)−yt\)2\]\.\\mathcal\{L\}\_\{critic\}=\\mathbb\{E\}\\left\[\\left\(Q\_\{1\}\(s\_\{t\},a\_\{t\}\)\-y\_\{t\}\\right\)^\{2\}\+\\left\(Q\_\{2\}\(s\_\{t\},a\_\{t\}\)\-y\_\{t\}\\right\)^\{2\}\\right\]\. #### 4\.2\.3\.Q\-Optimized Actor Training By leveraging the critic model, we optimize the training of the actor modules \(*i\.e\.,*DT and IDM\) through the incorporation of a Q\-value regularization term into their loss functions: \(14\)ℒactor=ℒdt\+ℒidm\+𝔼s\[−min\(Q1\(s,a^\),Q2\(s,a^\)\)\]\.\\mathcal\{L\}\_\{actor\}=\\mathcal\{L\}\_\{dt\}\+\\mathcal\{L\}\_\{idm\}\+\\mathbb\{E\}\_\{s\}\\left\[\-\\min\(Q\_\{1\}\(s,\\hat\{a\}\),Q\_\{2\}\(s,\\hat\{a\}\)\)\\right\]\.The negative sign in the regularization term incentivizes the actor to generate actions associated with higher Q\-values, thereby favoring behaviors that are expected to yield greater returns\. This composite loss function strikes a balance between learning behavioral policies from offline data and exploring novel policies with improved performance\. ### 4\.3\.Q\-value Based Action Selection at Inference During inference, our framework enables the generation of two candidate actions, based on both the Decision Transformer and the inverse dynamics module\. The Q\-value prediction module evaluates each candidate action and selects the one with the highest estimated Q\-value: \(15\)Qdt\\displaystyle Q\_\{\\text\{dt\}\}=min\{Q1\(s,a^\),Q2\(s,a^\)\},\\displaystyle=\\min\\left\\\{Q\_\{1\}\(s,\\hat\{a\}\),\\;Q\_\{2\}\(s,\\hat\{a\}\)\\right\\\},Qidm\\displaystyle Q\_\{\\text\{idm\}\}=min\{Q1\(s,a^idm\),Q2\(s,a^idm\)\},\\displaystyle=\\min\\left\\\{Q\_\{1\}\(s,\\hat\{a\}^\{\\text\{idm\}\}\),\\;Q\_\{2\}\(s,\\hat\{a\}^\{\\text\{idm\}\}\)\\right\\\},a∗\\displaystyle a^\{\*\}=argmax\{Qdt,Qidm\}\.\\displaystyle=\\arg\\max\\left\\\{Q\_\{\\text\{dt\}\},\\;Q\_\{\\text\{idm\}\}\\right\\\}\. By integrating the Q\-value prediction module,Guideprovides a principled, value\-driven, and flexible decision\-making mechanism for selecting among actions estimated by different perspectives, ensuring robust policy deployment and adaptability in dynamic or even unseen advertising environments, as demonstrated by our experimental results\. ### 4\.4\.Summary Within the proposed unified modeling architecture, the Q\-value optimization module incorporates Q\-value regularization into the DT loss, explicitly guiding the model to generate high\-value actions and thereby enhancing its exploration capability\. Compared with a traditional DT trained solely via behavior cloning, this mechanism is more effective at discovering high\-quality out\-of\-distribution trajectories\. Therefore, the generateda^t\\hat\{a\}\_\{t\}could be an effective exploration during the training process, wheres^t\+1\\hat\{s\}\_\{t\+1\}is the corresponding future state predicted by the DT as well\. In these cases,a^t\\hat\{a\}\_\{t\}would not be equal to the ground truthata\_\{t\}but leads to a better Q value instead\. Note that Q\-value module is not involved for IDM training\. The IDM reconstructs ground\-truth actions based solely on the estimated state transition patterns\. Hence, the IDM aims to memorize the transition patterns that lead to good explorations generated by the DT\. That is to say, the IDM tends to imitate the behavioral policy embedded in the dataset, working as a reliable fallback when exploratory actions may be suboptimal or risky\. ## 5\.Offline Experiment ### 5\.1\.Experimental Setting In this section, we conducted detailed offline experiments to answer the following questions: - •RQ1: DoesGuideperform better than other baseline methods across different testing environments? - •RQ2: How does each design choice contribute to the overall performance? - •RQ3: How do DT and IDM co\-operate together to improve bidding actions? We first describe the experimental settings in the following\. The code implementation has been released111https://github\.com/M2C\-Tech/GUIDE\. #### 5\.1\.1\.Datasets We use AuctionNet\(Suet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib22)\), a large\-scale benchmark proposed by Alibaba, for our evaluation\. AuctionNet is the official dataset and simulation environment for the NeurIPS 2024 Advertising Bidding Competition\. It is designed to model real\-world advertising auto\-bidding scenarios and consists of two main components: anoffline datasetand a dynamicsimulation environment\. The offline dataset simulates competition among 48 different advertisers over multiple advertising periods\. Each period contains approximately 500,000 impression opportunities and is divided into 48 decision steps\. The data is provided in two formats: \(1\) Traffic\-level data, which offers granular records for each impression, including features like predicted conversion probability \(pValue\), bid, cost, and win status\. \(2\) Trajectory\-level data, which aggregates the information into an RL\-style format with states, actions, and rewards for each advertiser at each decision step\. To ensure our evaluation is rigorous and reflects advanced real\-world challenges, we specifically use thefinal\-roundAuctionNet dataset, which is characterized by greater data sparsity and higher difficulty compared to the preliminary\-round version\. For a more comprehensive and dynamic assessment, we also employ the official simulation environment\(Suet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib22)\)\. This environment faithfully replicates the multi\-agent competitive dynamics of an industrial\-scale advertising platform\. In each evaluation, our proposed agent controls one advertiser and competes against the other 47 advertisers, which are operated by a diverse set of strong official baseline agents\. These baselines span various algorithms, including PID controllers, Online Linear Programming, Offline Reinforcement Learning, and Decision Transformers, creating a heterogeneous and challenging competitive landscape\. To ensure fair and robust assessment, the evaluation protocol requires each submitted strategy to sequentially control all 48 advertisers over multiple delivery periods, with the final performance being an aggregation of all results\. This two\-pronged evaluation approach allows for a thorough and realistic assessment of our method’s performance\. #### 5\.1\.2\.Metrics To assess the efficacy of our model, we utilize a performance metric termed the advertising bidding score\(Suet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib22)\)\. This score quantifies the agent’s proficiency in maximizing conversions while strictly adhering to the advertiser’s predefined Cost Per Acquisition \(CPA\) constraint, denoted asCC\. A penalty is applied if the actual CPA surpassesCC\. The score is formally defined as: \(16\)Score=ℙ\(CPA;C\)⋅∑ixi⋅viScore=\\mathbb\{P\}\(CPA;C\)\\cdot\\sum\_\{i\}x\_\{i\}\\cdot v\_\{i\}Here, the penalty function, which comes into effect when the actual CPA exceeds the constraintCC, is given by: \(17\)ℙ\(CPA;C\)=min\{\(CCPA\)β,1\}\\mathbb\{P\}\(CPA;C\)=\\min\\left\\\{\\left\(\\frac\{C\}\{CPA\}\\right\)^\{\\beta\},\\ 1\\right\\\}whereβ\\betais a positive hyperparameter \(commonly set toβ=2\\beta=2\)\. This penalty is specifically enforced only whenCPA\>CCPA\>C\. This metric enables a robust evaluation of the bidding agent’s capacity to balance conversion optimization with cost efficiency\. #### 5\.1\.3\.Baselines To evaluate the performance of ourGuide, we compare it against a range of baseline approaches, including both reinforcement learning\-based and generative model\-based methods\. For offline reinforcement learning, we considerBC\(Bain and Sammut,[1995](https://arxiv.org/html/2605.19457#bib.bib26)\),IQL\(Kostrikovet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib8)\),CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.19457#bib.bib9)\), andTD3\-BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.19457#bib.bib27)\), where all RL models are implemented following the AuctionNet\(Suet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib22)\)settings\. Among generative approaches, we includeAIGB\(Guoet al\.,[2024](https://arxiv.org/html/2605.19457#bib.bib13)\), a method built upon Decision Diffusion, as well asDT\(Chenet al\.,[2021](https://arxiv.org/html/2605.19457#bib.bib12)\)and its variants, namelyGAS\(Liet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib15)\)andGAVE\(Gaoet al\.,[2025](https://arxiv.org/html/2605.19457#bib.bib16)\), all of which are implemented using their released official code\. For fairness, the settings of Decision Transformer are kept identical across these generative baselines\. ### 5\.2\.RQ1: Offline Dataset and Simulation Environment Evaluation As shown in Tables[1](https://arxiv.org/html/2605.19457#S5.T1)and[2](https://arxiv.org/html/2605.19457#S5.T2), we evaluate the performance of different methods on offline datasets under various budget levels, as well as using a simulated environment\. Our key findings are as follows: - •Guideoutperforms all baselines across all budget levels in offline dataset testing, and also achieves superior performance in simulation environments, demonstrating its strong advantages and adaptability across diverse testing scenarios\. By employing unified modeling and trajectory exploration,Guidegains a deeper understanding of the advertising bidding environment, enabling it to maintain high performance across varying conditions\. - •Generative models with exploration capabilities, such as GAS and GAVE, outperform RL\-based models and the basic DT model\. This advantage is primarily attributed to their ability to effectively leverage historical bidding information while performing exploration, thereby mitigating the reliance on MDP modeling and overcoming limitations imposed by the dataset behavior policy\. - •AIGB performed poorly, yielding the worst results in both offline datasets and simulation environments\. This may be due to Decision Diffusion’s difficulty in effectively learning reasonable policies under conditions of long sequences and sparse rewards\. more efficient diffusion architectures combined with appropriate exploration strategies might be necessary for improvement\. Table 1\.Offline evaluation on the AuctionNet under different budgets\. Bold indicates the best results, underline denotes the second\-best results\.Table 2\.Performance in the Simulation Environment ### 5\.3\.RQ2: Model Analysis #### 5\.3\.1\.Ablation Study As shown in Figure[3](https://arxiv.org/html/2605.19457#S5.F3), to investigate the effectiveness of each proposed component, we conducted ablation studies by evaluating the following modified versions: - •w/o IDM Action: Keep the model structure unchanged, and use only actions from DT\. - •w/o DT Action: Keep the model structure unchanged, and use only actions from IDM\. - •w/o Q Optimization: Only remove Q\-value regularization optimization\. - •w/o Q Selection: Retain Q\-value regularization optimization, but select actions randomly\. - •w/o action modeling: remove the DT action loss, and use only actions from state modeling\. - •Original DT: Follow the settings of the original DT paper\(w/o state modeling\), and remove all optimizations\. According to Figure[3](https://arxiv.org/html/2605.19457#S5.F3), we can draw the following findings: First, omitting actions from the DT or IDM inGuideleads to performance degradation, confirming that coupling them together contributes positively and underscoring the necessity of unified modeling\. Figure 3\.Ablation StudySecond, random selection instead of Q\-value\-based selection also reduces performance, with results lying between those of using the two action sources separately\. This can be attributed to DT actions are higher in overall quality than IDM actions\. Third, removing Q\-value regularization optimization causes a significant drop in performance, though still outperforming the original DT\. This illustrates the effective role of Q\-value regularization, while also highlighting the advantage of the unified modeling architecture compared to the original DT\. Fourth, removing the action loss and using actions generated by state modeling yields a score only slightly higher than that of the original DT with action modeling, and significantly lower thanGuide\. This also underscores the importance of joint modeling\. #### 5\.3\.2\.Two\-stage Training Analysis We conduct a detailed analysis of the proposed two\-stage training strategy\. As shown in Figure[4](https://arxiv.org/html/2605.19457#S5.F4), the blue curve represents joint training of the DT and IDM modules throughout the entire process, where the loss decreases slowly\. The purple and green curves correspond to the two\-stage training and fully separate training strategies, respectively\. In the early stages of training, both purple and green curves are almost identical; in the later stages, after joint training begins, the purple curve’s loss decreases at a slower rate, but still faster than the blue curve\. Furthermore, it can be observed that the loss peaks of the purple curve are fewer and lower than those of the blue curve\. From the perspective of offline testing scores, the two\-stage training strategy also achieves the best performance\. Overall, the two\-stage training strategy demonstrates both stability and rapid convergence, offering a clear advantage and making it more suitable for practical applications and deployment\. Figure 4\.Two\-stage Training Analysis ### 5\.4\.RQ3: Co\-operation between DT and IDM As shown in Figure[5](https://arxiv.org/html/2605.19457#S5.F5), to investigate the impact of unified modeling on bidding actions, we have statistically analyzed the usage preferences of all 48 advertisers for actions from DT and IDM\. Our key findings are as follows: First, all advertisers utilize actions from both DT and IDM, with no instance of completely ignoring one source, indicating that both channels are effectively integrated within the unified modeling framework and contribute significantly to the bidding process\. Second, the majority of advertisers prefer DT over 70% of the time, indicating that DT often produces superior actions\. This phenomenon is consistent with our model design\. That is, by performing explicit action optimization via the Q regularization term, the DT framework can effectively explore OOD trajectories and generally outperform IDM\. When DT makes a risky OOD exploration, the model falls back to the more conservative actions generated by IDM as a safeguard\. We achieve a unified balance between exploration and safety\. Figure 5\.Action preferences of different advertisersThird, we conducted a detailed analysis of certain advertisers who show a preference for IDM to investigate the causes of this phenomenon\. To do this, we classified advertisers based on budget and CPA constraint levels\. Specifically, advertisers were grouped into three budget tiers: high budget \(top 30%\), medium budget \(middle 40%\), and low budget \(bottom 30%\)\. Similarly, CPA constraints were classified into high constraint \(top 30%\), medium constraint \(middle 40%\), and low constraint \(bottom 30%\)\. We conducted multiple experiments and averaged the results to reduce statistical noise\. The analysis identified four advertisers \(No\. 24, No\. 29, No\. 31, and No\. 38\) who consistently exhibited a higher preference for IDM\-generated actions\. All four belonged to one of the following extreme budget–constraint configurations: - •High budget combined with low constraint - •Low budget combined with high constraint These findings suggest that an advertiser’s budget and constraint settings can significantly influence the preference between DT and IDM strategies\. In particular, we believe that such extreme cases of misalignment between budget and CPA constraints can lead to errors in DT during exploration, causing the model to prefer the more conservative IDM policy\. In addition, we analyzed the volatility characteristics of actions generated by DT and IDM to further explain IDM’s role as a safeguard\. Figure[6](https://arxiv.org/html/2605.19457#S5.F6)shows the mean, variance, and standard deviation of bid actions for both methods\. These results quantitatively confirm that the overall volatility of bidding actions generated by IDM is smaller than its counterpart from DT\. This is consistent with DT’s exploratory nature, as discussed in Section[4\.4](https://arxiv.org/html/2605.19457#S4.SS4), and reinforces IDM’s function as a more stable and conservative safeguard within the unified model\. Figure 6\.Volatility comparison between DT and IDM bid actions ## 6\.Online A/B Test ### 6\.1\.Deployment To assess the real\-world performance ofGuide, we deployed it on Taobao, a major e\-commerce platform in China\. We use the DT model as the baseline for comparison withGuide\. This setting involves advertisers specifying their budgets and optionally setting various constraints, such as CPA or Return on Investment \(ROI\)\. The bidding policy is responsible for optimizing traffic value while strictly adhering to these constraints\. State Representation: Each campaign’s bidding environment is described by a 19\-dimensional state vector comprising campaign\-specific indicators \(e\.g\., fraction of budget and bidding steps remaining, deviation from target CPA\) and temporal statistics aggregated over recent time steps\. Features include impressions, clicks, conversions, ad cost, GMV, and derived metrics like CTR and CVR, offering a detailed snapshot of the campaign’s progress and recent market dynamics\. Action Mechanism: Rather than apply the model’s action output directly, we stabilize bid adjustments by smoothing the action with a windowed average over the preceding time steps\. Concretely, the final bid coefficient at each decision point is computed by blending the model’s current suggestion with the average of previous coefficients over a trailing two\-hour window\. This technique helps mitigate abrupt fluctuations in bidding strategies, promoting more consistent campaign outcomes\. Reward and Return\-to\-Go Formulation: The reward signal is grounded in the advertiser’s key business objective: maximizing gross merchandise value \(GMV\), subject to CPA and budget constraints\. Specifically, the return\-to\-go is defined as the expected cumulative GMV from the current time frame through the remainder of the promotional day, conditioned on considering all specified limits and penalty term\. Evaluation Metrics: The effectiveness of the deployedGuidepolicy is assessed using multiple key business and platform\-level metrics\. These include: - •Ad Click: Number of clicks on ads, reflecting user engagement with advertised content\. - •Ad Cost: Total advertising expenditure incurred via bidding, representing the financial outlay for acquiring traffic\. - •Ad GMV: Gross Merchandise Volume generated from ad clicks, measuring the transaction value directly attributable to advertising\. - •Ad ROI: Return on Investment, calculated as the ratio of Ad GMV to Ad Cost, indicating the efficiency and profitability of ad cost\. Training Phase\.Our online system is trained on one week of historical advertising campaign data, where each trajectory represents a single day’s sequence of observed states, decisions, and outcomes\. The core model uses a Decision Transformer architecture with 6 stacked Transformer blocks, each featuring 8 attention heads and a hidden dimension of 512\. The multilayer perceptrons in both the Inverse Dynamics Model and Q modules have a hidden dimension of 256\. Inference Phase\.Deployed on a leading ad platform,Guideserves as the bidding agent for all advertised items, generating and updating bid decisions every 30 minutes across the entire item set to enable dynamic ad allocation aligned with real\-time market conditions and user intent\. A large\-scale online A/B test covered approximately 160,000 distinct products and impacted tens of millions of dollars in gross merchandise value \(exact figure withheld per company policy\), consistently demonstrating significant gains in both efficiency and effectiveness, fully validating the system’s robustness and business value\. ### 6\.2\.Online A/B Test Results #### 6\.2\.1\.Overall Performance Table[3](https://arxiv.org/html/2605.19457#S6.T3)shows significant improvements achieved byGuideon key advertising metrics\. During an 8\-day online A/B test, Ad Clicks increased by 1\.40%, Ad Cost increased by 1\.66%, Ad GMV increased by 4\.10%, and Ad ROI increased by 3\.52%\. These performance gains fully demonstrate the effectiveness ofGuidein real\-world applications\. It is particularly insightful to analyze the interplay between these metrics\. The modest increase in Ad Cost \(1\.66%\) accompanied by a substantially larger increase in Ad GMV \(4\.10%\) is a strong indicator of improved spending efficiency\. This is not merely a case of bidding more aggressively to gain more traffic; rather, it demonstrates thatGuideallocates the budget more intelligently\. The resulting significant lift in Ad ROI \(\+3\.52%\) confirms that each dollar spent underGuide’s control is generating higher returns\. This demonstrates that the model’s unified approach allows it to identify and secure higher\-value ad impressions, specifically those with a greater likelihood of conversion, while steering clear of wasteful spending on less promising opportunities\. As a result, the performance improvements stem not just from increased traffic volume, but more importantly from the enhanced quality and profitability of the acquired traffic, which is the ultimate objective of any advanced auto\-bidding system\. Table 3\.Improvements of GUIDE on Key Advertising MetricsMetricAd ClickAd CostAd GMVAd ROIImprovement1\.40%1\.66%4\.10%3\.52% #### 6\.2\.2\.Trajectory optimization capability A foundational principle in computational advertising is that advertising revenue is maximized when the expenditure trend dynamically aligns with the natural fluctuations in user traffic\(Agarwalet al\.,[2014](https://arxiv.org/html/2605.19457#bib.bib28)\)\. To evaluateGuide’s ability to perform such dynamic budget allocation, we conducted an analysis of its cost trajectory control\. We began by constructing an ideal ad cost trajectory in a post\-hoc manner, which serves as an oracle benchmark representing the optimal spending pattern proportional to the observed traffic\. Figure[7](https://arxiv.org/html/2605.19457#S6.F7)visually compares the actual cost trajectories managed by a baseline method and our proposedGuideagainst this ideal trajectory\. As can be seen, while both methods attempt to follow the general trend, the cost distribution controlled byGuide\(right panel\) tracks the ideal cost much more closely across the different time steps\. The baseline method \(left panel\), in contrast, shows more significant deviations, particularly exhibiting over\-expenditure during several peak hours\. To quantify this alignment, we employed the Pearson correlation coefficient\. The trajectory controlled byGuideachieved a correlation of 96\.31% with the optimal trajectory, which is a clear improvement over the baseline’s 93\.73%\. This result quantitatively confirms thatGuidepossesses superior trajectory optimization capabilities, enabling more precise and effective budget pacing over time\. Figure 7\.Cost Trajectory Analysis ## 7\.Conclusion and Limitations We presentGuide, a unified modeling and exploration approach for automatic Ad bidding\.Guideintegrates three carefully designed components: the Decision Transformer \(DT\), the Inverse Dynamics Model \(IDM\), and the Q\-value module\. Their synergistic interaction enables a balance between exploration and safety, leading to significant improvements in bidding strategies\. Experiments show thatGuideconsistently outperforms baseline methods across offline data, simulations, and real\-world applications\. In online tests,Guideincreased Ad GMV by 4\.10%, providing an effective solution for automatic bidding in complex advertising scenarios\. DespiteGuide’s strong performance in ad bidding, it lacks fine\-grained mechanisms to handle abrupt traffic changes, limiting its responsiveness during sudden fluctuations or special events\. Moreover, it primarily relies on offline data and current model architectures\. Future work could integrate LLMs for trajectory control and dynamic optimization, enhancing robustness and adaptability in evolving marketplaces\. ###### Acknowledgements\. This work was supported by Alibaba Group through Alibaba Innovative Research Program; and it was also supported by National Natural Science Foundation of China \(No\. 62272349\)\. ## References - D\. Agarwal, S\. Ghosh, K\. Wei, and S\. You \(2014\)Budget pacing for targeted online advertisements at linkedin\.InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1613–1619\.Cited by:[§6\.2\.2](https://arxiv.org/html/2605.19457#S6.SS2.SSS2.p1.1)\. - A\. Ajay, Y\. Du, A\. Gupta, J\. Tenenbaum, T\. Jaakkola, and P\. Agrawal \(2022\)Is conditional generative modeling all you need for decision\-making?\.arXiv preprint arXiv:2211\.15657\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p2.1)\. - M\. Bain and C\. Sammut \(1995\)A framework for behavioural cloning\.\.InMachine intelligence 15,pp\. 103–129\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - R\. P\. Borase, D\. Maghade, S\. Sondkar, and S\. Pawar \(2021\)A review of pid control, tuning methods and applications\.International Journal of Dynamics and Control9\(2\),pp\. 818–827\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - N\. Borissov, D\. Neumann, and C\. Weinhardt \(2010\)Automated bidding in computational markets: an application in market\-based allocation of computing services\.Autonomous Agents and Multi\-Agent Systems21\(2\),pp\. 115–142\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - C\. Boutilier, T\. Dean, and S\. Hanks \(1999\)Decision\-theoretic planning: structural assumptions and computational leverage\.Journal of Artificial Intelligence Research11,pp\. 1–94\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p2.1)\. - H\. Cai, K\. Ren, W\. Zhang, K\. Malialis, J\. Wang, Y\. Yu, and D\. Guo \(2017\)Real\-time bidding by reinforcement learning in display advertising\.InProceedings of the tenth ACM international conference on web search and data mining,pp\. 661–670\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p2.1),[§2](https://arxiv.org/html/2605.19457#S2.p3.1)\. - L\. Cai, J\. He, Y\. Li, J\. Liang, Y\. Lin, Z\. Quan, Y\. Zeng, and J\. Xu \(2025\)RTBAgent: a llm\-based agent system for real\-time bidding\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 104–113\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p2.1)\. - L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1),[§2](https://arxiv.org/html/2605.19457#S2.p2.1),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - Y\. Chen, P\. Berkhin, B\. Anderson, and N\. R\. Devanur \(2011\)Real\-time bidding algorithms for performance\-based display ad allocation\.InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1307–1315\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - G\. B\. Dantzig \(2016\)Linear programming and extensions\.Cited by:[§3\.1\.2](https://arxiv.org/html/2605.19457#S3.SS1.SSS2.p1.3)\. - S\. Fujimoto, E\. Conti, M\. Ghavamzadeh, and J\. Pineau \(2019\)Benchmarking batch deep reinforcement learning algorithms\.arXiv preprint arXiv:1910\.01708\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p2.1)\. - S\. Fujimoto and S\. S\. Gu \(2021\)A minimalist approach to offline reinforcement learning\.Advances in neural information processing systems34,pp\. 20132–20145\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.19457#S4.SS2.SSS1.p1.8),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - J\. Gao, Y\. Li, S\. Mao, P\. Jiang, N\. Jiang, Y\. Wang, Q\. Cai, F\. Pan, P\. Jiang, K\. Gai,et al\.\(2025\)Generative auto\-bidding with value\-guided explorations\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 244–254\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1),[§2](https://arxiv.org/html/2605.19457#S2.p2.1),[§4\.1\.1](https://arxiv.org/html/2605.19457#S4.SS1.SSS1.p2.4),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - J\. Guo, Y\. Huo, Z\. Zhang, T\. Wang, C\. Yu, J\. Xu, B\. Zheng, and Y\. Zhang \(2024\)Generative auto\-bidding via conditional diffusion modeling\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5038–5049\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1),[§2](https://arxiv.org/html/2605.19457#S2.p2.1),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - Y\. He, X\. Chen, D\. Wu, J\. Pan, Q\. Tan, C\. Yu, J\. Xu, and X\. Zhu \(2021\)A unified solution to constrained bidding in online display advertising\.InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining,pp\. 2993–3001\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - J\. Ji, T\. Wang, Y\. Li, Y\. Huo, Z\. Zhang, C\. Yu, J\. Xu, and B\. Zheng \(2025\)Bid2X: revealing dynamics of bidding environment in online advertising from a foundation model lens\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 4543–4554\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p3.1)\. - H\. Jiang, Y\. Tang, Y\. Zeng, P\. Yuan, Y\. Cheng, T\. Sha, X\. Liu, and P\. Jiang \(2025\)Optimal return\-to\-go guided decision transformer for auto\-bidding in advertisement\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 1033–1037\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1)\. - J\. Jin, C\. Song, H\. Li, K\. Gai, J\. Wang, and W\. Zhang \(2018\)Real\-time bidding with multi\-agent reinforcement learning in display advertising\.InProceedings of the 27th ACM international conference on information and knowledge management,pp\. 2193–2201\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p3.1)\. - J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p3.1)\. - C\. Knospe \(2006\)PID control\.IEEE Control Systems Magazine26\(1\),pp\. 30–31\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - I\. Kostrikov, A\. Nair, and S\. Levine \(2021\)Offline reinforcement learning with implicit q\-learning\.arXiv preprint arXiv:2110\.06169\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p2.1),[§2](https://arxiv.org/html/2605.19457#S2.p1.1),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.Advances in neural information processing systems33,pp\. 1179–1191\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - H\. Li, Y\. Huo, S\. Dou, Z\. Zheng, Z\. Zhang, C\. Yu, J\. Xu, and F\. Wu \(2024\)Trajectory\-wise iterative reinforcement learning framework for auto\-bidding\.InProceedings of the ACM Web Conference 2024,pp\. 4193–4203\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - J\. Li and P\. Tang \(2022\)Auto\-bidding equilibrium in roi\-constrained online advertising markets\.arXiv preprint arXiv:2210\.06107\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - Y\. Li, S\. Mao, J\. Gao, N\. Jiang, Y\. Xu, Q\. Cai, F\. Pan, P\. Jiang, and B\. An \(2025\)GAS: generative auto\-bidding with post\-training search\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 315–324\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1),[§2](https://arxiv.org/html/2605.19457#S2.p2.1),[§4\.1\.1](https://arxiv.org/html/2605.19457#S4.SS1.SSS1.p2.4),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - M\. Liu, L\. Jiaxing, Z\. Hu, J\. Liu, and X\. Nie \(2020\)A dynamic bidding strategy based on model\-free reinforcement learning in display advertising\.IEEE Access8,pp\. 213587–213601\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - H\. Lu, D\. Han, Y\. Shen, and D\. Li \(2025\)What makes a good diffusion planner for decision making?\.arXiv preprint arXiv:2503\.00535\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1)\. - Z\. Mou, Y\. Huo, R\. Bai, M\. Xie, C\. Yu, J\. Xu, and B\. Zheng \(2022\)Sustainable online reinforcement learning for auto\-bidding\.Advances in Neural Information Processing Systems35,pp\. 2651–2663\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - Y\. Peng, W\. Shu, J\. Sun, Y\. Zeng, J\. Pang, W\. Bai, Y\. Bai, X\. Liu, and P\. Jiang \(2025\)Expert\-guided diffusion planner for auto\-bidding\.arXiv preprint arXiv:2508\.08687\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p3.1),[§2](https://arxiv.org/html/2605.19457#S2.p2.1)\. - M\. L\. Puterman \(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p2.1)\. - K\. Su, Y\. Huo, Z\. Zhang, S\. Dou, C\. Yu, J\. Xu, Z\. Lu, and B\. Zheng \(2024\)Auctionnet: a novel benchmark for decision\-making in large\-scale games\.Advances in Neural Information Processing Systems37,pp\. 94428–94452\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p3.1),[§5\.1\.1](https://arxiv.org/html/2605.19457#S5.SS1.SSS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.19457#S5.SS1.SSS1.p3.1),[§5\.1\.2](https://arxiv.org/html/2605.19457#S5.SS1.SSS2.p1.2),[§5\.1\.3](https://arxiv.org/html/2605.19457#S5.SS1.SSS3.p1.1)\. - C\. Wen, M\. Xu, Z\. Zhang, Z\. Zheng, Y\. Wang, X\. Liu, Y\. Rong, D\. Xie, X\. Tan, C\. Yu,et al\.\(2022\)A cooperative\-competitive multi\-agent framework for auto\-bidding in online advertising\.InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining,pp\. 1129–1139\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - X\. Yang, Y\. Li, H\. Wang, D\. Wu, Q\. Tan, J\. Xu, and K\. Gai \(2019\)Bid optimization by multivariable control in display advertising\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 1966–1974\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - H\. Yu, M\. Neely, and X\. Wei \(2017\)Online convex optimization with stochastic constraints\.Advances in Neural Information Processing Systems30\.Cited by:[§3\.1\.2](https://arxiv.org/html/2605.19457#S3.SS1.SSS2.p1.3)\. - C\. Yuan, M\. Guo, C\. Xiang, S\. Wang, G\. Song, and Q\. Zhang \(2022\)An actor\-critic reinforcement learning model for optimal bidding in online display advertising\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,pp\. 3604–3613\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - S\. Yuan, J\. Wang, and X\. Zhao \(2013\)Real\-time bidding for online advertising: measurement and analysis\.InProceedings of the seventh international workshop on data mining for online advertising,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1),[§2](https://arxiv.org/html/2605.19457#S2.p3.1)\. - H\. Zhang, L\. Niu, Z\. Zheng, Z\. Zhang, S\. Gu, F\. Wu, C\. Yu, J\. Xu, G\. Chen, and B\. Zheng \(2023\)A personalized automated bidding framework for fairness\-aware online advertising\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5544–5553\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - W\. Zhang, Y\. Rong, J\. Wang, T\. Zhu, and X\. Wang \(2016\)Feedback control of real\-time display advertising\.InProceedings of the Ninth ACM International Conference on Web Search and Data Mining,pp\. 407–416\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p1.1)\. - W\. Zhang, S\. Yuan, and J\. Wang \(2014\)Optimal real\-time bidding for display advertising\.InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1077–1086\.Cited by:[§1](https://arxiv.org/html/2605.19457#S1.p1.1)\. - Z\. Zhu, M\. Liu, L\. Mao, B\. Kang, M\. Xu, Y\. Yu, S\. Ermon, and W\. Zhang \(2024\)Madiff: offline multi\-agent learning with diffusion models\.Advances in Neural Information Processing Systems37,pp\. 4177–4206\.Cited by:[§2](https://arxiv.org/html/2605.19457#S2.p2.1)\.
Similar Articles
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems
The paper introduces GAMBLe, a framework that decomposes AI-Driven Research Systems into generator, assessor, discovery mechanism, and budget, revealing how component interactions shape optimization landscapes. Experiments on NP-hard problems show no universally best configuration, emphasizing the need for careful component selection.
Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces
Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
UCB exploration via Q-ensembles
OpenAI presents a novel exploration strategy for deep reinforcement learning using ensembles of Q-functions with upper-confidence bounds (UCB), demonstrating significant performance improvements on the Atari benchmark.