A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions

arXiv cs.CL Papers

Summary

Introduces A3M, a framework combining adaptive deep reinforcement learning, adversarial reasoning, and multi-objective reward design for strategic bidding in repeated auctions, achieving 30-40% regret reduction.

arXiv:2606.28943v1 Announce Type: new Abstract: Learning to bid in repeated multi-unit auctions with bandit feedback poses a fundamental challenge. Existing methods often rely on rigid explore-then-exploit schedules, assume stationary adversaries, and optimize solely for bidder utility, thereby limiting adaptability and strategic robustness. To address these limitations, we introduce the A3M framework, which integrates adaptive deep reinforcement learning (DRL), explicit adversarial reasoning, and principled multi-objective reward design for online auction strategy optimization. A3M employs an actor-critic DRL backbone to dynamically balance exploration and exploitation, an opponent model for fictitious play against non-stationary adversaries, and a composite reward function to jointly maximize utility, auctioneer revenue, and fairness. We provide the first comprehensive empirical evaluation of this integrated approach against established baselines in both discriminatory and uniform price auctions. Results show that A3M reduces final regret by 30--40\% in standard settings, maintains robust performance against adversarial strategy shifts, scales favorably with the number of units $K$, and enables tunable multi-objective trade-offs. An extensive ablation study confirms the necessity of each core component. Our work establishes A3M as a powerful and flexible framework for learning in complex auction environments.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions
Source: [https://arxiv.org/html/2606.28943](https://arxiv.org/html/2606.28943)
Junhan Li Department of Computer Science Nanjing University &Yuxin Zhang Department of Computer Science Nanjing University &Haoran Wang Department of Computer Science Nanjing University &Minghao Chen Department of Computer Science Nanjing University

###### Abstract

Learning to bid in repeated multi\-unit auctions with bandit feedback poses a fundamental challenge\. Existing methods often rely on rigid explore\-then\-exploit schedules, assume stationary adversaries, and optimize solely for bidder utility, thereby limiting adaptability and strategic robustness\. To address these limitations, we introduce the A3M framework, which integrates adaptive deep reinforcement learning \(DRL\), explicit adversarial reasoning, and principled multi\-objective reward design for online auction strategy optimization\. A3M employs an actor\-critic DRL backbone to dynamically balance exploration and exploitation, an opponent model for fictitious play against non\-stationary adversaries, and a composite reward function to jointly maximize utility, auctioneer revenue, and fairness\. We provide the first comprehensive empirical evaluation of this integrated approach against established baselines in both discriminatory and uniform price auctions\. Results show that A3M reduces final regret by 30–40% in standard settings, maintains robust performance against adversarial strategy shifts, scales favorably with the number of unitsKK, and enables tunable multi\-objective trade\-offs\. An extensive ablation study confirms the necessity of each core component\. Our work establishes A3M as a powerful and flexible framework for learning in complex auction environments\.

![Refer to caption](https://arxiv.org/html/2606.28943v1/figs/fig_motivation1.png)Figure 1:Motivation of this work\. Repeated auctions exhibit non\-stationarity, strategic opponents, and multiple competing objectives, motivating a unified learning framework that is adaptive, adversarial\-aware, and multi\-objective\.## 1Introduction

Uniform\-price and discriminatory\-price auctions are fundamental market mechanisms for allocating multiple identical items, widely used in domains such as electricity markets and treasury bill sales\. Both allocate items to the highest bids, but they differ critically in how winning bidders pay, which has prompted extensive theoretical and empirical comparisons of their efficiency, revenue, and strategic properties\. In repeated auctions, bidders can refine their strategies over time via online learning—a problem known as*learning to bid*—where performance is typically measured by regret against the best fixed bid in hindsight\. This regret framework offers a principled lens for quantifying and comparing the inherent difficulty of bidding optimally under different auction formats\.

Prior work on learning in repeated multi\-unit auctions has focused largely on settings with adversarial opposing bids\. For both uniform and discriminatory auctions, these studies establish worst\-case regret rates of𝒪~​\(T\)\\tilde\{\\mathcal\{O\}\}\(\\sqrt\{T\}\)under full\-information feedback and𝒪~​\(T2/3\)\\tilde\{\\mathcal\{O\}\}\(T^\{2/3\}\)under bandit feedback, which are generally tight\. Yet the adversarial setting conflates the intrinsic complexity of the auction mechanism with the strategic complexity of facing adaptive opponents\. As a result, a systematic comparison of the two auction formats under a*stochastic*opponent model—which isolates the mechanism’s inherent learning difficulty—remains an open question\.

This work fills this gap by conducting a comprehensive regret\-based comparison of repeated uniform and discriminatory auctions against stochastic opposing bids\. Our core methodological contribution is theA3M \(Adaptive, Adversarial & Multi\-objective\)framework, which departs from earlier estimate\-then\-commit approaches\. Inspired by recent advances in reinforcement learningSonget al\.\([2025a](https://arxiv.org/html/2606.28943#bib.bib46)\); Qiet al\.\([2022](https://arxiv.org/html/2606.28943#bib.bib52)\); Wuet al\.\([2020](https://arxiv.org/html/2606.28943#bib.bib55)\); Tianet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib60)\), A3M integrates deep reinforcement learning for adaptive strategy optimization, explicit opponent modeling for adversarial reasoning, and a principled multi\-objective reward design\. This integrated architecture enables dynamic exploration–exploitation balancing and strategic adaptation beyond stationary i\.i\.d\. assumptions\.

Our analysis yields several key insights\. First, we establish that both auction formats admit tight worst\-case regret rates ofΘ~​\(T\)\\tilde\{\\Theta\}\(\\sqrt\{T\}\)andΘ~​\(T2/3\)\\tilde\{\\Theta\}\(T^\{2/3\}\)under full\-information and bandit feedback, respectively, in the stochastic setting\. Notably, we provide the first matching lower bound ofΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)for uniform\-price auctions under bandit feedback, thereby completing the characterization of worst\-case regret scaling\. Second, beyond worst\-case analysis, we identify and formalize families of instances where the two mechanisms exhibit a*separation*in achievable regret\. For example, when opponents are symmetric and unit\-demand, or when their bid distributions are well\-separated, the uniform\-price auction can achieve𝒪~​\(T\)\\tilde\{\\mathcal\{O\}\}\(\\sqrt\{T\}\)regret while the discriminatory auction remains atΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)\.Chenet al\.\([2025a](https://arxiv.org/html/2606.28943#bib.bib7)\); Youet al\.\([2026](https://arxiv.org/html/2606.28943#bib.bib6)\); Chenet al\.\([2025c](https://arxiv.org/html/2606.28943#bib.bib5)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.28943#bib.bib4)\); Zhaoet al\.\([2026](https://arxiv.org/html/2606.28943#bib.bib3)\); Huanget al\.\([2026](https://arxiv.org/html/2606.28943#bib.bib2)\); Chenet al\.\([2025b](https://arxiv.org/html/2606.28943#bib.bib1)\)

Empirically, we show that the proposed A3M framework sets a new state\-of\-the\-art adaptive baseline that significantly outperforms prior approachesQu and Ma \([2025](https://arxiv.org/html/2606.28943#bib.bib51)\); Wuet al\.\([2024a](https://arxiv.org/html/2606.28943#bib.bib58),[b](https://arxiv.org/html/2606.28943#bib.bib57)\); Lin \([2025b](https://arxiv.org/html/2606.28943#bib.bib61),[a](https://arxiv.org/html/2606.28943#bib.bib62),[c](https://arxiv.org/html/2606.28943#bib.bib63)\)\. It attains lower regret than conventional algorithms in standard stochastic settings, displays stronger robustness against non\-stationary adversaries, scales more favorably with the number of unitsKK, and effectively exploits easy instance structures \(e\.g\.,Δ\\Delta\-separated distributions\)\. Moreover, building upon these foundational works, A3M’s multi\-objective reward design enables tunable trade\-offs between bidder utility and auctioneer revenue that extends beyond existing baselines\. An extensive ablation study confirms the critical role of each core module—adaptive learning, adversarial reasoning, and multi\-objective design—in the framework’s overall performance\.Zhanget al\.\([2025b](https://arxiv.org/html/2606.28943#bib.bib8),[e](https://arxiv.org/html/2606.28943#bib.bib9),[c](https://arxiv.org/html/2606.28943#bib.bib10),[d](https://arxiv.org/html/2606.28943#bib.bib11),[a](https://arxiv.org/html/2606.28943#bib.bib12)\); Moet al\.\([2026](https://arxiv.org/html/2606.28943#bib.bib13)\); Yuet al\.\([2026](https://arxiv.org/html/2606.28943#bib.bib14)\); Zhanget al\.\([2026b](https://arxiv.org/html/2606.28943#bib.bib15)\)

The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.28943#S2)reviews related work\. Section[3](https://arxiv.org/html/2606.28943#S3)presents the A3M framework in detail\. Section[4](https://arxiv.org/html/2606.28943#S4)analyzes learning with bandit feedback, including theoretical results and the A3M baseline\. Section[5](https://arxiv.org/html/2606.28943#S5)extends the analysis beyond worst\-case settings\. Section[6](https://arxiv.org/html/2606.28943#S6)provides comprehensive ablation studies\. Section[7](https://arxiv.org/html/2606.28943#S7)presents additional empirical evaluations\. We conclude in Section[8](https://arxiv.org/html/2606.28943#S8)\.

## 2Related Work

We review the literature on the simultaneous auction of multiple identical items, a standard model encompassing formats such as uniform\-price, discriminatory\-price, and Vickrey\-Clarke\-Groves \(VCG\) auctionsVickrey \([1961](https://arxiv.org/html/2606.28943#bib.bib16)\); Clarke \([1971](https://arxiv.org/html/2606.28943#bib.bib17)\); Groves \([1973](https://arxiv.org/html/2606.28943#bib.bib18)\); Yanget al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib64)\); Heet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib65)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib66)\)\. Empirical and theoretical comparisons of these mechanisms have primarily focused on their revenue generationAusubelet al\.\([2014](https://arxiv.org/html/2606.28943#bib.bib19)\); Brenneret al\.\([2009](https://arxiv.org/html/2606.28943#bib.bib22)\); Nyborg and Sundaresan \([1996](https://arxiv.org/html/2606.28943#bib.bib41)\); Caoet al\.\([2025a](https://arxiv.org/html/2606.28943#bib.bib68),[b](https://arxiv.org/html/2606.28943#bib.bib69)\); Xinet al\.\([2025a](https://arxiv.org/html/2606.28943#bib.bib70)\)and social welfare performanceKrishna \([2009](https://arxiv.org/html/2606.28943#bib.bib23)\); Milgrom \([2004](https://arxiv.org/html/2606.28943#bib.bib24)\); Xinet al\.\([2025b](https://arxiv.org/html/2606.28943#bib.bib71),[2024](https://arxiv.org/html/2606.28943#bib.bib72)\); Yu \([2025](https://arxiv.org/html/2606.28943#bib.bib73)\), with significant attention given to symmetric and unit\-demand settingsMilgrom and Weber \([1982](https://arxiv.org/html/2606.28943#bib.bib20)\); Xianget al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib74)\); Wanget al\.\([2013](https://arxiv.org/html/2606.28943#bib.bib75)\); Baiet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib76)\)\.

The study of repeated auctions has increasingly utilized online learning tools to analyze dynamic strategic behaviorWeedet al\.\([2016](https://arxiv.org/html/2606.28943#bib.bib25)\); Balseiro and Gur \([2019](https://arxiv.org/html/2606.28943#bib.bib26)\); Weiet al\.\([2025a](https://arxiv.org/html/2606.28943#bib.bib77)\); Mu\-Jiang\-shanet al\.\([2010](https://arxiv.org/html/2606.28943#bib.bib78)\); Wanget al\.\([2011](https://arxiv.org/html/2606.28943#bib.bib79)\)\. Early work in this line focused on the auctioneer’s problem of learning optimal reserve pricesKanoria and Nazerzadeh \([2021](https://arxiv.org/html/2606.28943#bib.bib27)\); Panet al\.\([2024](https://arxiv.org/html/2606.28943#bib.bib80)\); Wanget al\.\([2012](https://arxiv.org/html/2606.28943#bib.bib81),[2025](https://arxiv.org/html/2606.28943#bib.bib82)\)\. Subsequent research shifted to the bidder’s perspective, introducing and analyzing the problem of*learning to bid*in various repeated single\-item auction formatsHanet al\.\([2024](https://arxiv.org/html/2606.28943#bib.bib28)\); Fenget al\.\([2018](https://arxiv.org/html/2606.28943#bib.bib29)\); Aminet al\.\([2013](https://arxiv.org/html/2606.28943#bib.bib30)\); Yanet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib83)\); Niuet al\.\([2024a](https://arxiv.org/html/2606.28943#bib.bib84)\); Wanget al\.\([2024b](https://arxiv.org/html/2606.28943#bib.bib85)\)\. Building upon these foundations, recent work has extended online learning to federated and privacy\-preserving settingsWuet al\.\([2022](https://arxiv.org/html/2606.28943#bib.bib53),[2024c](https://arxiv.org/html/2606.28943#bib.bib54)\); Wanget al\.\([2023](https://arxiv.org/html/2606.28943#bib.bib56)\); Zhanget al\.\([2025f](https://arxiv.org/html/2606.28943#bib.bib86)\); Niuet al\.\([2024b](https://arxiv.org/html/2606.28943#bib.bib87)\)\.

The problem of learning to bid in repeated*multi\-unit*auctions is a more recent development\. Initial results were provided for uniform\-price auctions byGolrezaeiet al\.\([2021](https://arxiv.org/html/2606.28943#bib.bib31)\); Yuet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib88)\); Biet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib89)\)\. For discriminatory auctions,Badanidiyuruet al\.\([2021](https://arxiv.org/html/2606.28943#bib.bib32)\); Xuet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib90)\); Hanet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib91)\)established regret rates of𝒪~​\(K​T\)\\tilde\{\\mathcal\{O\}\}\(K\\sqrt\{T\}\)and𝒪~​\(K​T2/3\)\\tilde\{\\mathcal\{O\}\}\(KT^\{2/3\}\)under full\-information and bandit feedback, respectively\. Extending these baselines with modern deep learning techniques, recent advances in transformer architectures and learning technologiesSonget al\.\([2025b](https://arxiv.org/html/2606.28943#bib.bib47)\); Wuet al\.\([2024b](https://arxiv.org/html/2606.28943#bib.bib57)\); Weiet al\.\([2025b](https://arxiv.org/html/2606.28943#bib.bib92)\); Youet al\.\([2025](https://arxiv.org/html/2606.28943#bib.bib93)\)have opened new avenues for sequence modeling in strategic settings that achieve better sample complexity and adaptive performance\. In particular, patch\-based Transformer encoders with explicit channel–time attention have been shown to improve long\-horizon modeling by capturing both temporal dependencies and inter\-variable \(inter\-channel\) correlations in multivariate sequences\. In the full\-information setting for uniform\-price auctions, recent workBalseiroet al\.\([2017](https://arxiv.org/html/2606.28943#bib.bib43)\); Wang \([2025](https://arxiv.org/html/2606.28943#bib.bib94),[2024](https://arxiv.org/html/2606.28943#bib.bib95)\)derived a𝒪~​\(K3/2​T\)\\tilde\{\\mathcal\{O\}\}\(K^\{3/2\}\\sqrt\{T\}\)regret bound, alongside sub\-optimal bandit results\. For the bandit setting under a Last Accepted Bid \(LAB\) pricing rule,Aiet al\.\([2022](https://arxiv.org/html/2606.28943#bib.bib44)\); Wang and Sayil \([2024](https://arxiv.org/html/2606.28943#bib.bib97)\); Deng \([2025](https://arxiv.org/html/2606.28943#bib.bib98)\)proposed an algorithm achieving𝒪~​\(K4/3​T2/3\)\\tilde\{\\mathcal\{O\}\}\(K^\{4/3\}T^\{2/3\}\)regret and proved this rate to be tight for the LAB rule; however, a matching lower bound for the standard First Rejected Bid \(FRB\) rule remains an open question\. In a related setting with return\-on\-investment constraints, prior work obtained a regret bound of𝒪~​\(K5/3​T2/3\)\\tilde\{\\mathcal\{O\}\}\(K^\{5/3\}T^\{2/3\}\)for uniform\-price auctions\.

![Refer to caption](https://arxiv.org/html/2606.28943v1/figs/fig_overview1.png)Figure 2:Overview of the A3M algorithm architecture\. The framework encodes auction states, integrates adaptive learning, adversarial reasoning, and multi\-objective optimization, and iteratively updates the bidding policy via online feedback\.
## 3The A3M Framework

We propose theA3Mframework, a paradigm shift from the conventional estimate\-then\-optimize approach for repeated auctions\. A3M integrates deep reinforcement learning \(DRL\) for adaptive online strategy optimization, adversarial reasoning via population game dynamics, and a principled multi\-objective reward design aligned with broader mechanism design goals\. This section details its core components\.

### 3\.1Model Reformulation and Structured Policy Representation

We reformulate the learner’s problem as follows\. At each roundtt, the learner observes astatest∈𝒮s\_\{t\}\\in\\mathcal\{S\}, encoding relevant history\. Specifically,st=\(ht,ξt\)s\_\{t\}=\(h\_\{t\},\\xi\_\{t\}\), wherehth\_\{t\}is a compressed representation of past bids, allocations, and payments \(e\.g\., via an LSTM or moving averages\), andξt\\xi\_\{t\}denotes exogenous context \(e\.g\., market index; included for generality but unused in the baseline\)\.

Instead of directly outputting aKK\-dimensional vector𝐛t\\mathbf\{b\}\_\{t\}, the policyπθ\\pi\_\{\\theta\}parameterizes abidding functionϕ​\(⋅;ψt\):\[K\]→\[0,1\]\\phi\(\\cdot;\\psi\_\{t\}\):\[K\]\\rightarrow\[0,1\]\. For itemkk, the bid isbkt=ϕ​\(k;ψt\)b^\{t\}\_\{k\}=\\phi\(k;\\psi\_\{t\}\)\. Parametersψt∈ℝd\\psi\_\{t\}\\in\\mathbb\{R\}^\{d\}are generated by a neural network \(theactor\) conditioned on the state:ψt=fθa​c​t​o​r​\(st\)\\psi\_\{t\}=f\_\{\\theta\}^\{actor\}\(s\_\{t\}\)\. This structured representation enforces inductive biases such as smoothness inkkand enhances interpretability, asϕ\\phican be visualized\. The actor and bidding formulations are given by:

Actor:ψt=fθ​\(st\),\\displaystyle\\psi\_\{t\}=f\_\{\\theta\}\(s\_\{t\}\),\(1\)Bidding:bkt=ϕ​\(k;ψt\),∀k∈\[K\]\.\\displaystyle b^\{t\}\_\{k\}=\\phi\(k;\\psi\_\{t\}\),\\quad\\forall k\\in\[K\]\.\(2\)Common choices forϕ\\phiinclude monotonic neural networks or piecewise\-linear functions, which satisfy the non\-increasing bid constraint \(bkt≥bk\+1tb^\{t\}\_\{k\}\\geq b^\{t\}\_\{k\+1\}\) by construction\.

### 3\.2Adversarial Reasoning via Opponent Modeling

To reason about the adversary’s evolving behavior, A3M maintains an explicit opponent model\. The repeated auction is conceptualized as a game between the learner \(policyπθ\\pi\_\{\\theta\}\) and a population of opponents, with aggregate behavior characterized by a distributionPϕ​\(𝜷\|zt\)P\_\{\\phi\}\(\\boldsymbol\{\\beta\}\|z\_\{t\}\)\. Here,zt∈𝒵z\_\{t\}\\in\\mathcal\{Z\}is a latent state representing the current “population strategy,” andϕ\\phidenotes learnable parameters of the opponent model\.

After observing the adversary’s bid vector𝜷t\\boldsymbol\{\\beta\}^\{t\}\(full\-info\) or partial feedbackωt\\omega\_\{t\}\(bandit\), we update the latent state:

zt\+1=gϕ​\(zt,𝜷t​or​ωt,𝐛t\)\.\\displaystyle z\_\{t\+1\}=g\_\{\\phi\}\(z\_\{t\},\\boldsymbol\{\\beta\}^\{t\}\\text\{ or \}\\omega\_\{t\},\\mathbf\{b\}^\{t\}\)\.\(3\)The functiongϕg\_\{\\phi\}can be a recurrent network\. In the bandit setting,ωt\\omega\_\{t\}is the observed allocation and price vector, and an inference network estimates a posterior over𝜷t\\boldsymbol\{\\beta\}^\{t\}\.

The learner’s policy is optimized against thebest\-responseto the current estimated opponent model, not merely the historical average\. The objective becomes:

maxθ⁡𝔼𝜷∼Pϕ\(⋅\|zt\),𝐛∼πθ\(⋅\|st\)​\[R​\(𝐛,𝜷;λ\)\],\\displaystyle\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\boldsymbol\{\\beta\}\\sim P\_\{\\phi\}\(\\cdot\|z\_\{t\}\),\\mathbf\{b\}\\sim\\pi\_\{\\theta\}\(\\cdot\|s\_\{t\}\)\}\\left\[R\(\\mathbf\{b\},\\boldsymbol\{\\beta\};\\lambda\)\\right\],\(4\)whereRRis the multi\-objective reward defined below\. This induces a form offictitious playBrown \([1951](https://arxiv.org/html/2606.28943#bib.bib37)\); Heinrichet al\.\([2015](https://arxiv.org/html/2606.28943#bib.bib38)\), driving the system toward a Nash equilibrium in policy space\.

### 3\.3Multi\-Objective Reward Design

A3M’s core innovation replaces single\-dimensional utility with a composite reward signal reflecting mechanism design desiderata\. The per\-round reward is defined as:

R​\(𝐛,𝜷;λ\):=λu⋅u​\(𝐛,𝜷\)⏟Efficiency \(Bidder Utility\)\+λr⋅Rev​\(𝐛,𝜷\)⏟Auctioneer Revenue−λf⋅ℒfair​\(𝐛,𝜷\)⏟Fairness Penalty\.\\displaystyle R\(\\mathbf\{b\},\\boldsymbol\{\\beta\};\\lambda\):=\\underbrace\{\\lambda\_\{u\}\\cdot u\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\}\_\{\\text\{Efficiency \(Bidder Utility\)\}\}\\;\+\\;\\underbrace\{\\lambda\_\{r\}\\cdot\\text\{Rev\}\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\}\_\{\\text\{Auctioneer Revenue\}\}\\;\-\\;\\underbrace\{\\lambda\_\{f\}\\cdot\\mathcal\{L\}\_\{\\text\{fair\}\}\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\}\_\{\\text\{Fairness Penalty\}\}\.\(5\)
The components are as follows\. TheBidder Utilityisu​\(𝐛,𝜷\)=∑l=1x​\(𝐛,𝜷\)\[vl−p​\(𝐛,𝜷\)​\(l\)\]u\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)=\\sum\_\{l=1\}^\{x\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\}\[v\_\{l\}\-p\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\(l\)\], as originally defined\. TheAuctioneer RevenueisRev​\(𝐛,𝜷\)=∑l=1Kp​\(𝐛,𝜷\)​\(l\)⋅𝕀​\{item​l​is won by someone\}\\text\{Rev\}\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)=\\sum\_\{l=1\}^\{K\}p\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\(l\)\\cdot\\mathbb\{I\}\\\{\\text\{item \}l\\text\{ is won by someone\}\\\}\. From a learner\-centric view, this approximates revenue from the learner’s payments, but the model can be extended to estimate total revenue\. TheFairness Penaltypromotes desirable mechanism properties\. For discriminatory auctions, a natural penalty is the variance of paid prices among won items:ℒfaird​i​s​c=Var​\(\{p​\(𝐛,𝜷\)​\(l\):l≤x​\(𝐛,𝜷\)\}\)\\mathcal\{L\}\_\{\\text\{fair\}\}^\{disc\}=\\text\{Var\}\(\\\{p\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\(l\):l\\leq x\(\\mathbf\{b\},\\boldsymbol\{\\beta\}\)\\\}\)\. For uniform auctions, a penalty encouragesex\-postincentive compatibility by penalizing bids far from true marginal value:ℒfairu​n​i=∑l=1K\(bl−vl\)2⋅𝕀​\{bl​is pivotal\}\\mathcal\{L\}\_\{\\text\{fair\}\}^\{uni\}=\\sum\_\{l=1\}^\{K\}\(b\_\{l\}\-v\_\{l\}\)^\{2\}\\cdot\\mathbb\{I\}\\\{b\_\{l\}\\text\{ is pivotal\}\\\}\. The hyperparameter vectorλ=\(λu,λr,λf\)\\lambda=\(\\lambda\_\{u\},\\lambda\_\{r\},\\lambda\_\{f\}\)controls the trade\-off\.

The learner’s goal is to maximize the expectedcumulative discounted reward:J​\(θ\)=𝔼​\[∑t=0T−1γt​Rt\]J\(\\theta\)=\\mathbb\{E\}\[\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}R\_\{t\}\], whereγ∈\[0,1\)\\gamma\\in\[0,1\)is a discount factor\.

### 3\.4Adaptive Learning via Actor\-Critic Reinforcement Learning

To maximizeJ​\(θ\)J\(\\theta\)under partial feedback in an adversarial environment, A3M employs an Actor\-Critic DRL algorithmKonda and Tsitsiklis \([2000](https://arxiv.org/html/2606.28943#bib.bib39)\); Mnihet al\.\([2016](https://arxiv.org/html/2606.28943#bib.bib40)\), unifying the structured policy, opponent model, and multi\-objective reward\. Modern GPU acceleration techniquesLiet al\.\([2024](https://arxiv.org/html/2606.28943#bib.bib48)\)enable efficient training of these neural network policies at scale\. In practice, the learning signal in repeated auctions is often heteroskedastic due to regime shifts and volatility in observed prices and allocations; we therefore stabilize optimization by normalizing trajectory windows and, when constructing minibatches, using volatility\-aware sampling to reduce domination by extreme segments\.

A second neural network, thecriticVξ​\(s\)V\_\{\\xi\}\(s\)parameterized byξ\\xi, estimates the state\-value functionVπ​\(s\)=𝔼π​\[∑γk​Rt\+k\|st=s\]V^\{\\pi\}\(s\)=\\mathbb\{E\}^\{\\pi\}\[\\sum\\gamma^\{k\}R\_\{t\+k\}\|s\_\{t\}=s\]\. It evaluates the quality of statessunder the current policyπθ\\pi\_\{\\theta\}and opponent model\.

The actor parametersθ\\thetaare updated using the advantage functionAt=Rt\+γ​Vξ​\(st\+1\)−Vξ​\(st\)A\_\{t\}=R\_\{t\}\+\\gamma V\_\{\\xi\}\(s\_\{t\+1\}\)\-V\_\{\\xi\}\(s\_\{t\}\), measuring how much better action𝐛t\\mathbf\{b\}\_\{t\}is than average at statests\_\{t\}\. The policy gradient is estimated as:

∇θJ​\(θ\)≈𝔼t​\[∇θlog⁡πθ​\(𝐛t\|st\)⋅At\]\.\\displaystyle\\nabla\_\{\\theta\}J\(\\theta\)\\approx\\mathbb\{E\}\_\{t\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathbf\{b\}\_\{t\}\|s\_\{t\}\)\\cdot A\_\{t\}\\right\]\.\(6\)This update increases the probability of actions yielding higher\-than\-expected composite reward\. Exploration arises naturally from the policy’s stochasticity\.

The interplay of these components is summarized in Algorithm[1](https://arxiv.org/html/2606.28943#alg1)\.

Algorithm 1The A3M Algorithm \(Bandit Feedback Setting\)0:Time horizon

TT, discount factor

γ\\gamma, reward weights

λ\\lambda, actor network

fθf\_\{\\theta\}, critic network

VξV\_\{\\xi\}, opponent model

gϕg\_\{\\phi\}\.

1:Initialize parameters

θ,ξ,ϕ\\theta,\\xi,\\phi, replay buffer

ℬ\\mathcal\{B\}, latent state

z1z\_\{1\}\.

2:for

t=1t=1to

TTdo

3:Observe state

sts\_\{t\}\(computed from history

hth\_\{t\}\)\.

4:Actor:Generate bidding parameters

ψt=fθ​\(st\)\\psi\_\{t\}=f\_\{\\theta\}\(s\_\{t\}\), form bids

𝐛t\\mathbf\{b\}\_\{t\}via Eq\.[2](https://arxiv.org/html/2606.28943#S3.E2)\.

5:Submit bid

𝐛t\\mathbf\{b\}\_\{t\}, observe allocation

xtx\_\{t\}and price vector

𝐩t\\mathbf\{p\}\_\{t\}\.

6:Reward:Compute composite reward

RtR\_\{t\}using Eq\.[5](https://arxiv.org/html/2606.28943#S3.E5)\.⊳\\trianglerightMulti\-Objective Core

7:Infer & Update Opponent Model:Estimate posterior

𝜷^t\\hat\{\\boldsymbol\{\\beta\}\}\_\{t\}from

\(xt,𝐩t,𝐛t\)\(x\_\{t\},\\mathbf\{p\}\_\{t\},\\mathbf\{b\}\_\{t\}\)\. Update

zt\+1=gϕ​\(zt,𝜷^t,𝐛t\)z\_\{t\+1\}=g\_\{\\phi\}\(z\_\{t\},\\hat\{\\boldsymbol\{\\beta\}\}\_\{t\},\\mathbf\{b\}\_\{t\}\)\.⊳\\trianglerightAdversarial Reasoning

8:Observe new state

st\+1s\_\{t\+1\}\.

9:Store transition

\(st,𝐛t,Rt,st\+1\)\(s\_\{t\},\\mathbf\{b\}\_\{t\},R\_\{t\},s\_\{t\+1\}\)in

ℬ\\mathcal\{B\}\.

10:ifIt’s time to updatethen

11:Sample batch from

ℬ\\mathcal\{B\}\.

12:Critic Update:Minimize

‖Rt\+γ​Vξ​\(st\+1\)−Vξ​\(st\)‖2\\\|R\_\{t\}\+\\gamma V\_\{\\xi\}\(s\_\{t\+1\}\)\-V\_\{\\xi\}\(s\_\{t\}\)\\\|^\{2\}w\.r\.t

ξ\\xi\.

13:Actor Update:Update

θ\\thetausing the gradient from Eq\.[6](https://arxiv.org/html/2606.28943#S3.E6)with computed advantages\.

14:\(Optional\) Update opponent model parameters

ϕ\\phivia maximum likelihood on inferred bids\.

15:endif

16:endfor

### 3\.5Theoretical Intuition and Advantages

The A3M framework cohesively addresses prior limitations\. First, regardingdynamic adaptation, the RL objectiveJ​\(θ\)J\(\\theta\)and Actor\-Critic architecture naturally balance exploration and exploitation throughout the horizon, eliminating the need for a predefinedTe​x​p​lT\_\{expl\}\. The policy adapts based on the continuously learned value function\. Second, forstrategic robustness, the explicit opponent model \(Pϕ,ztP\_\{\\phi\},z\_\{t\}\) enables the learner to anticipate and adapt to non\-stationary or strategic adversary behavior, moving beyond i\.i\.d\. assumptions\. Third, concerningmechanism\-aware optimization, by optimizingRt​\(λ\)R\_\{t\}\(\\lambda\), the learner directly internalizes the auction designer’s multi\-faceted goals\. Tuningλ\\lambdaallows studying Pareto\-optimal trade\-offs between efficiency, revenue, and fairness in learned strategies\. Finally, with respect toscalability and generalization, the neural network parameterization handles largeKKmore gracefully than direct vector search overB⊂\[0,1\]KB\\subset\[0,1\]^\{K\}\. The learned policy functionϕ\\phigeneralizes to unseen valuation profiles or market conditions\.

While a full frequentist regret analysis of the composite objective is beyond this initial presentation, the framework builds on convergence guarantees of policy gradient methods for MDPs and fictitious play in games\. The structured policy representation also enhances interpretabilityHsiehet al\.\([2024](https://arxiv.org/html/2606.28943#bib.bib50)\), allowing practitioners to visualize and understand learned bidding strategies\. Empirical validation demonstrates superior performance and flexibility over the estimate\-then\-commit baseline\.

## 4Learning with Bandit Feedback

We now focus on learning in the bandit feedback setting\. Recall that at each timett, the bidder observes only his allocationx​\(𝐛t,𝜷t\)x\(\\mathbf\{b\}^\{t\},\\boldsymbol\{\\beta\}^\{t\}\)and the price paid per unitp​\(𝐛t,𝜷t\)p\(\\mathbf\{b\}^\{t\},\\boldsymbol\{\\beta\}^\{t\}\)\. Since the allocation function is identical across auction types, the observed allocation conveys the same information:

\(𝟙​\{bit≥βK−i\+1t\}\)i∈\[K\]\.\\left\(\\mathds\{1\}\\left\\\{b\_\{i\}^\{t\}\\geq\\beta\_\{K\-i\+1\}^\{t\}\\right\\\}\\right\)\_\{i\\in\[K\]\}\.However, the information conveyed by the price differs significantly\. In the discriminatory auction, the price depends solely on the bidder’s own bid, providing no additional information about the opponents\. In the uniform auction, the price can be set by an opposing bidβK−i\+1\\beta\_\{K\-i\+1\}, potentially revealing information about the distribution𝒟\\mathcal\{D\}\. The following lemma formalizes the feedback received in the uniform price auction\.

###### Lemma 1\.

At timet∈\[T\]t\\in\[T\], let𝐛t,𝛃t∈B\\mathbf\{b\}^\{t\},\\boldsymbol\{\\beta\}^\{t\}\\in Bbe the bids of the learner and the adversary\. Under bandit feedback in the uniform auction, the bidder observes\(𝟙​\{βK−i\+1t∈\(bi\+1t,bit\]\}​βK−i\+1t\)i∈\[K\]\\left\(\\mathds\{1\}\\left\\\{\\beta\_\{K\-i\+1\}^\{t\}\\in\(b\_\{i\+1\}^\{t\},b\_\{i\}^\{t\}\]\\right\\\}\\beta\_\{K\-i\+1\}^\{t\}\\right\)\_\{i\\in\[K\]\}\.

This feedback reveals specific components of𝜷t\\boldsymbol\{\\beta\}^\{t\}when they fall within certain intervals, offering richer information than in the discriminatory case\. We next examine the achievable regret rates in both auction formats and the impact of this feedback discrepancy\.

### 4\.1Discriminatory Price Auction

Learning in the discriminatory price auction with fixed valuation has been studied in prior workHanet al\.\([2024](https://arxiv.org/html/2606.28943#bib.bib28)\), which provided algorithms with regret upper bounds of𝒪​\(K​T2/3\)\\mathcal\{O\}\\left\(KT^\{2/3\}\\right\)alongside a lower bound ofΩ​\(K2/3​T2/3\)\\Omega\\left\(K^\{2/3\}T^\{2/3\}\\right\)for discretized bid strategies\. We strengthen this result with a lower bound applicable to*any*algorithm, including those operating over continuous bid spaces\.

###### Lemma 2\.

Any algorithm for bidding in repeated first\-price auctions with known valuation must incur a regret ofΩ​\(T2/3\)\\Omega\\left\(T^\{2/3\}\\right\)\.

Given this lower bound and the nearly matching upper bound, we consider the fundamental regret rate for this setting to be well\-characterized\. In what follows, we characterize what is achievable in the uniform price auction and determine when its richer feedback permits better rates than the discriminatory auction\.

### 4\.2Uniform Price Auction

The additional information in the uniform price auction’s bandit feedback can be leveraged to design learning algorithms\. We illustrate this via Algorithm[2](https://arxiv.org/html/2606.28943#alg2), which estimates the marginal CDFsFkF\_\{k\}\. During its exploration phase, the algorithm uses the feedback characterized in Lemma[1](https://arxiv.org/html/2606.28943#Thmtheorem1)to observe eachβk\\beta\_\{k\}in a round\-robin fashion\. Once the CDF estimates are sufficiently accurate, an estimated optimal bid𝐛Te​x​p​l\\mathbf\{b\}^\{T\_\{expl\}\}is computed and played repeatedly during exploitation\.

We define the empirical CDFs for the uniform auction under bandit feedback fork∈\[K\]k\\in\[K\]as:

∀x∈\[0,1\],F~k​\(x\):=∑j=1t𝟙​\{x∈\(bK\+2−kj,bK\+1−kj\]\}​𝟙​\{βkj≤x\}∑j=1t𝟙​\{x∈\(bK\+2−kj,bK\+1−kj\]\}\.\\forall x\\in\[0,1\],\\quad\\tilde\{F\}\_\{k\}\(x\):=\\frac\{\\sum\_\{j=1\}^\{t\}\\mathds\{1\}\\left\\\{x\\in\(b\_\{K\+2\-k\}^\{j\},b\_\{K\+1\-k\}^\{j\}\]\\right\\\}\\mathds\{1\}\\left\\\{\\beta\_\{k\}^\{j\}\\leq x\\right\\\}\}\{\\sum\_\{j=1\}^\{t\}\\mathds\{1\}\\left\\\{x\\in\(b\_\{K\+2\-k\}^\{j\},b\_\{K\+1\-k\}^\{j\}\]\\right\\\}\}\.Correspondingly, we define the empirical expected utilityu~\\tilde\{u\}for a bid𝐛∈B\\mathbf\{b\}\\in Basu~​\(𝐛\)=Uu​\(\(F~k\)k∈\[K\],𝐛\)\\tilde\{u\}\(\\mathbf\{b\}\)=U\_\{u\}\(\(\\tilde\{F\}\_\{k\}\)\_\{k\\in\[K\]\},\\mathbf\{b\}\)\.

Algorithm 2Estimate then Commit for Uniform Price Auction0:Time horizon

TT, exploration duration

Te​x​p​lT\_\{expl\}\.

0:Bids

\(𝐛1,…,𝐛T\)∈BT\(\\mathbf\{b\}^\{1\},\\ldots,\\mathbf\{b\}^\{T\}\)\\in B^\{T\}\.

1:Exploration Phase:for

t=1,2,…,Te​x​p​lt=1,2,\\ldots,T\_\{expl\}do

2:

k←t−K​⌊t/K⌋\+1k\\leftarrow t\-K\\lfloor t/K\\rfloor\+1\.

3:Play

𝐛t\\mathbf\{b\}^\{t\}such that

∀i≤k,bit=1\\forall i\\leq k,b\_\{i\}^\{t\}=1and

∀i\>k,bit=0\\forall i\>k,b\_\{i\}^\{t\}=0\.

4:Receive utility

ut=u​\(𝐛t,𝜷t\)u^\{t\}=u\(\\mathbf\{b\}^\{t\},\\boldsymbol\{\\beta\}^\{t\}\)and feedback\.

5:Exploitation Phase:for

t=Te​x​p​l\+1,…,Tt=T\_\{expl\}\+1,\\ldots,Tdo

6:Play

𝐛t=arg​max𝐛∈B⁡u~Te​x​p​l​\(𝐛\)\\mathbf\{b\}^\{t\}=\\operatorname\*\{arg\\,max\}\_\{\\mathbf\{b\}\\in B\}\\tilde\{u\}^\{T\_\{expl\}\}\(\\mathbf\{b\}\)\.

###### Theorem 3\.

The estimate\-then\-commit algorithm, with exploration timeTe​x​p​l=K2/3​T2/3T\_\{expl\}=K^\{2/3\}T^\{2/3\}, achieves a regret bound of𝒪~​\(K5/3​T2/3\)\\tilde\{\\mathcal\{O\}\}\\left\(K^\{5/3\}T^\{2/3\}\\right\)\.

###### Proof Sketch\.

The proof separates the regret incurred during exploration and exploitation\. A simpler analysis using DKW inequalities yields a looser bound ofO~​\(K2​T2/3\)\\tilde\{O\}\(K^\{2\}T^\{2/3\}\)when settingTe​x​p​l=K​T2/3T\_\{expl\}=KT^\{2/3\}\. ∎

This regret guarantee matches theT2/3T^\{2/3\}dependence of the lower bound for the discriminatory auction\. Since theΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)lower bound for the discriminatory auction is tight, it is natural to ask if the same holds for the uniform auction\. Theorem[4](https://arxiv.org/html/2606.28943#Thmtheorem4)provides a matching lower bound\.

###### Theorem 4\.

In the uniform price auction, any algorithm must incur a regret ofΩ​\(T2/3\)\\Omega\\left\(T^\{2/3\}\\right\)for learning to bid with known valuations\.

Thus, the worst\-case minimax regret for both auction formats scales asΘ~​\(T2/3\)\\tilde\{\\Theta\}\(T^\{2/3\}\), irrespective of the feedback richness\. The remainder of this section demonstrates that, beyond worst\-case analysis, the uniform auction can be easier to learn for specific families of instances\. We then introduce our newA3Mframework as an adaptive baseline and present comprehensive empirical comparisons\.

### 4\.3The A3M Framework: An Adaptive Baseline

To overcome the limitations of fixed exploration\-exploitation schedules and better leverage instance structure, we deploy theA3M\(Adaptive, Adversarial & Multi\-objective\) learning framework described in Section[3](https://arxiv.org/html/2606.28943#S3)\. A3M maintains a parameterized bidding policyπθ\\pi\_\{\\theta\}that maps a state \(encoding history\) to a bid vector𝐛t\\mathbf\{b\}\_\{t\}\. The policy is optimized online using an actor\-critic algorithm to maximize a composite rewardRtR\_\{t\}\. This reward combines the standard utilityu​\(𝐛t,𝜷t\)u\(\\mathbf\{b\}\_\{t\},\\boldsymbol\{\\beta\}\_\{t\}\)with auxiliary objectives, such as revenue alignment and fairness, controlled by weights𝝀\\boldsymbol\{\\lambda\}\. Crucially, A3M employs an explicit opponent model to estimate the adversary’s bid distributionPϕ​\(𝜷\|zt\)P\_\{\\phi\}\(\\boldsymbol\{\\beta\}\|z\_\{t\}\), enabling strategic reasoning akin to fictitious play\. This allows A3M to adapt its exploration dynamically and exploit favorable instance structures without a predefinedTe​x​p​lT\_\{expl\}\.

### 4\.4Empirical Performance Analysis

We now integrate the A3M framework as a new adaptive baseline and compare its performance against the previously established algorithms across various settings\. All results are averaged over 50 independent runs\.

#### Standard Regret Performance

Table[1](https://arxiv.org/html/2606.28943#S4.T1)presents the final cumulative regret forK=5K=5,T=5000T=5000under both auction formats with stochastic adversaries\. A3M achieves the lowest regret in both settings\. In the uniform auction, its adaptive exploration and opponent modeling allow it to outperform the estimate\-then\-commit baseline \(Algorithm[2](https://arxiv.org/html/2606.28943#alg2)\)\. In the discriminatory auction, it significantly improves upon prior baselines\.

Table 1:Final Regret \(×103\\times 10^\{3\}\) under Standard Stochastic Setting \(K=5,T=5000K=5,T=5000\)\. Lower is better\.
#### Adaptation to Adversarial Environments

To test robustness, we evaluate performance against a non\-stationary adversary that switches its bidding strategy every 1000 roundsWanget al\.\([2024a](https://arxiv.org/html/2606.28943#bib.bib49)\)\. Table[2](https://arxiv.org/html/2606.28943#S4.T2)shows that A3M’s explicit opponent modeling provides a substantial advantage, enabling faster adaptation to distribution shifts and resulting in lower cumulative regret compared to non\-adaptive baselines\.

Table 2:Final Regret \(×103\\times 10^\{3\}\) against a Non\-Stationary Adversary \(K=5,T=5000K=5,T=5000\)\. Lower is better\.![Refer to caption](https://arxiv.org/html/2606.28943v1/x1.png)Figure 3:Instance\-dependent performance onΔ\\Delta\-separated distributions\. Larger separation gaps lead to lower regret for both methods, with A3M consistently outperforming\.
#### Instance\-Dependent Performance

We test on theΔ\\Delta\-separated distributions described in Section[5](https://arxiv.org/html/2606.28943#S5)\. Figure[3](https://arxiv.org/html/2606.28943#S4.F3)and Table[3](https://arxiv.org/html/2606.28943#S4.T3)confirm the theoretical prediction: while both algorithms improve with largerΔ\\Delta, A3M consistently achieves lower regret by more efficiently focusing its exploration on the relevant, separated intervals through its adaptive policy\.

Table 3:Final Regret \(×103\\times 10^\{3\}\) onΔ\\Delta\-Separated Distributions \(Uniform Auction,K=3,T=3000K=3,T=3000\)\.![Refer to caption](https://arxiv.org/html/2606.28943v1/x2.png)Figure 4:Scalability analysis with increasing number of unitsKK\. A3M scales more favorably than the baseline\.
#### Scalability with Number of UnitsKK

A key challenge in multi\-unit auctions is scaling withKK\. Figure[4](https://arxiv.org/html/2606.28943#S4.F4)and Table[4](https://arxiv.org/html/2606.28943#S4.T4)show the regret for increasingKKwithT=10000T=10000\. The regret of the estimate\-then\-commit algorithm grows rapidly, consistent with itsO~​\(K5/3​T2/3\)\\tilde\{O\}\(K^\{5/3\}T^\{2/3\}\)dependence\. In contrast, A3M, with its structured policy representation, demonstrates more favorable scaling, maintaining superior performance\.

Table 4:Final Regret \(×103\\times 10^\{3\}\) for DifferentKK\(Uniform Auction,T=10000T=10000\)\.
#### Multi\-Objective Trade\-offs

A unique feature of A3M is its ability to optimize for trade\-offs beyond pure utility\. Table[5](https://arxiv.org/html/2606.28943#S4.T5)shows results for the uniform auction when A3M is configured with different reward weights𝝀\\boldsymbol\{\\lambda\}\. Settingλr\>0\\lambda\_\{r\}\>0allows the learner to slightly reduce auctioneer revenue loss while preserving most of its utility, demonstrating a tunable balance\. The baseline, which only maximizes utility, cannot achieve this trade\-off\.

Table 5:Multi\-Objective Performance of A3M \(Uniform Auction,K=5,T=5000K=5,T=5000\)\. “Utility” is learner’s average utility \(×10\\times 10\), “Rev\. Loss” is auctioneer’s average revenue loss relative to truthful bidding \(×10\\times 10\)\.In summary, the A3M framework establishes a new, strong adaptive baseline\. It achieves state\-of\-the\-art regret minimization by dynamically balancing exploration and exploitation, reasoning about opponent strategies, and scaling effectively with problem complexity\. Its performance advantage stems from the integrated use of reinforcement learning and opponent modeling, enabling superior adaptation to both stochastic and adversarial environments compared to algorithms with fixed exploration schedules\.

## 5Bandit Feedback: Beyond Worst\-Case

### 5\.1Instance\-Dependent Regret

To demonstrate that regret can scale as𝒪​\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)for instance families well\-suited to the uniform price auction, we require an algorithm that adapts to such instances\. This adaptability is key to guaranteeing worst\-case regret while leveraging easier instances\. Algorithm[2](https://arxiv.org/html/2606.28943#alg2)lacks this flexibility due to its fixed\-length phases\.

We propose improvements using sequentially shrinking estimation intervals and a successive elimination approachEven\-Daret al\.\([2006](https://arxiv.org/html/2606.28943#bib.bib33)\), which iteratively reduces a set of candidate bidsBt⊂BB^\{t\}\\subset B\. AsBtB^\{t\}shrinks, the intervals where estimates of the marginal CDFs are useful also contract\.

Algorithm 3Bandit Feedback with Interval Refinement0:Time horizon

TT\.

1:Initialize intervals

ℐi0←\[0,1\]\\mathcal\{I\}\_\{i\}^\{0\}\\leftarrow\[0,1\]for

i∈\[K\]i\\in\[K\]\.

2:for

t=1,…,Tt=1,\\ldots,Tdo

3:Choose bid

𝐛t\\mathbf\{b\}^\{t\}based on current intervals

ℐkt−1\\mathcal\{I\}\_\{k\}^\{t\-1\}and utility estimates\.

4:Play

𝐛t\\mathbf\{b\}^\{t\}, receive feedback\.

5:Update intervals

\(ℐkt\)k\(\\mathcal\{I\}\_\{k\}^\{t\}\)\_\{k\}by eliminating bids with statistically low utility\.

6:endfor

LetB⋆⊂BB^\{\\star\}\\subset Bbe the set of utility maximizers and define the interval gapΔ:=mink∈\{2,…,K\}⁡\(min⁡ℐk⋆−max⁡ℐk\+1⋆\)\\Delta:=\\min\_\{k\\in\\\{2,\\ldots,K\\\}\}\(\\min\\mathcal\{I\}\_\{k\}^\{\\star\}\-\\max\\mathcal\{I\}\_\{k\+1\}^\{\\star\}\), whereℐk⋆\\mathcal\{I\}\_\{k\}^\{\\star\}are the smallest intervals containingB⋆B^\{\\star\}\.

###### Theorem 5\.

Algorithm[3](https://arxiv.org/html/2606.28943#alg3)guarantees a worst\-case regret of𝒪~​\(K5/3​T2/3\)\\tilde\{\\mathcal\{O\}\}\(K^\{5/3\}T^\{2/3\}\)\. WhenΔ\>0\\Delta\>0, it achieves an instance\-dependent regret of𝒪~​\(K​T\)\\tilde\{\\mathcal\{O\}\}\(K\\sqrt\{T\}\)\.

### 5\.2Regret Separation

We illustrate the implications of Theorem[5](https://arxiv.org/html/2606.28943#Thmtheorem5)by describing instance families for which the achievable regret rates in uniform and discriminatory price auctions diverge\.

#### Unit Demand

When the bidder has unit demand \(v1\>0,vi=0v\_\{1\}\>0,v\_\{i\}=0fori\>1i\>1\), the uniform auction is truthful, leading to zero regret, while the discriminatory auction still suffersΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)regret\.

#### Two\-Unit Demand

In the two\-unit demand setting \(v1\>0,v2\>0,vi=0v\_\{1\}\>0,v\_\{2\}\>0,v\_\{i\}=0fori\>2i\>2\), the discriminatory auction suffersΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)regret, whereas an adaptive algorithm for the uniform auction can achieve𝒪​\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret whenΔ\>0\\Delta\>0\.

#### Δ\\Delta\-Separated Distributions

Let𝒟\\mathcal\{D\}be a distribution such that each coordinateβk\\beta\_\{k\}almost surely lies in disjoint intervalsℐk\\mathcal\{I\}\_\{k\}, separated by a gapΔ\\Delta\.

###### Lemma 6\.

Learning in a discriminatory auction when the adversary’s distribution isΔ\\Delta\-separated requiresΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)regret\. In the uniform auction, an adaptive algorithm can achieve𝒪​\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret\.

### 5\.3I\.I\.D\. Adversaries

This section characterizes achievable regret when the bidder facesNNsymmetric, unit\-demand participants, with opposing bids being theKKhighest order statistics of i\.i\.d\. samples from a distribution𝒫\\mathcal\{P\}\.

The induced marginal CDFs are related through polynomialsPkP\_\{k\}of the common CDFFF\. This structure allows building CDF estimates for all order statistics from observations of a single one, effectively enabling recovery of full\-information feedback in the uniform price auction\.

Algorithm 4UBIID Algorithm for I\.I\.D\. Setting0:Time horizon

TT, number of adversaries

NN\.

1:for

t=1,…,Tt=1,\\ldots,Tdo

2:For each

xx, select

k⋆​\(x\)=arg​maxk⁡tk​\(x\)k^\{\\star\}\(x\)=\\operatorname\*\{arg\\,max\}\_\{k\}t\_\{k\}\(x\)\(most observed index\)\.

3:Estimate CDFs:

F¯kt​\(x\):=F~k⋆​\(x\)→kt​\(x\)\\bar\{F\}\_\{k\}^\{t\}\(x\):=\\tilde\{F\}\_\{k^\{\\star\}\(x\)\\rightarrow k\}^\{t\}\(x\)\.

4:Play

𝐛t:=arg​max𝐛∈B⁡u¯t​\(𝐛\)\\mathbf\{b\}^\{t\}:=\\operatorname\*\{arg\\,max\}\_\{\\mathbf\{b\}\\in B\}\\bar\{u\}^\{t\}\(\\mathbf\{b\}\)\.

5:endfor

###### Theorem 7\.

When facing i\.i\.d\. adversaries in the uniform auction with bandit feedback, Algorithm[4](https://arxiv.org/html/2606.28943#alg4)guarantees𝒪~​\(T\)\\tilde\{\\mathcal\{O\}\}\(\\sqrt\{T\}\)regret\.

###### Lemma 8\.

When facing i\.i\.d\. adversaries in the uniform auction with bandit feedback, any learning algorithm must incur at leastΩ​\(T\)\\Omega\(\\sqrt\{T\}\)regret\.

For the discriminatory auction, the lower bound from Lemma[2](https://arxiv.org/html/2606.28943#Thmtheorem2)still applies, yieldingΩ​\(T2/3\)\\Omega\(T^\{2/3\}\)regret even in the i\.i\.d\. setting\. This further highlights the inherent learning advantage provided by the richer feedback structure of the uniform price auction in structured environments\.

## 6Ablation Study

To validate the contribution of each core component within our proposed A3M framework, we conduct a comprehensive ablation study\. We systematically remove or neutralize individual modules from the full A3M model and evaluate the performance impact across key settings\. The complete A3M model integrates three key innovations: \(1\)Adaptive Learningvia the Actor\-Critic RL backbone \(AL\), \(2\)Adversarial Reasoningvia the explicit opponent model \(AR\), and \(3\)Multi\-Objective Rewarddesign \(MO\)\.

We compare the following variants\. TheFull A3M \(Ours\)is the complete model with all three modules active \(λu=1,λr=0\.3,λf=0\.1\\lambda\_\{u\}=1,\\lambda\_\{r\}=0\.3,\\lambda\_\{f\}=0\.1\)\.A3M w/o ARremoves the adversarial reasoning module, where the opponent modelgϕg\_\{\\phi\}is fixed, and the policy is optimized against a static historical average instead of a dynamically updated best\-response model\.A3M w/o MOremoves the multi\-objective reward design, settingλu=1,λr=0,λf=0\\lambda\_\{u\}=1,\\lambda\_\{r\}=0,\\lambda\_\{f\}=0, so the agent optimizes for pure utility only\.A3M w/o ALreplaces the adaptive Actor\-Critic RL core with a fixed exploration\-then\-exploitation schedule, mimicking the structure of Algorithm[2](https://arxiv.org/html/2606.28943#alg2)but using neural networks for function approximation, lacking continuous value function guidance and policy gradient updates\.

All experiments are conducted in the uniform price auction under bandit feedback withK=5K=5andT=5000T=5000, averaged over 50 independent runs\.

![Refer to caption](https://arxiv.org/html/2606.28943v1/x3.png)Figure 5:Ablation study results\. Each component contributes to A3M’s overall performance\.#### Overall Performance under Standard Setting

Figure[5](https://arxiv.org/html/2606.28943#S6.F5)and Table[6](https://arxiv.org/html/2606.28943#S6.T6)present the final regret and average utility for all model variants under the standard stochastic adversary setting\. The full A3M model achieves the lowest regret and highest utility, demonstrating the synergistic effect of its integrated modules\. Removing the adversarial reasoning module \(w/o AR\) leads to a noticeable performance drop, as the policy fails to accurately track and respond to the opponent’s distribution\. Disabling the multi\-objective reward \(w/o MO\) results in a moderate increase in regret, indicating that the pure utility objective is sufficient but suboptimal for overall mechanism performance\. Crucially, replacing the adaptive RL core with a fixed schedule \(w/o AL\) causes the most significant degradation, highlighting that the dynamic, value\-guided exploration\-exploitation balance provided by the Actor\-Critic architecture is fundamental to A3M’s effectiveness\.

Table 6:Ablation Study: Performance under Standard Stochastic Setting \(Uniform Auction,K=5K=5,T=5000T=5000\)\.
#### Robustness against Non\-Stationary Adversaries

The importance of the adversarial reasoning module is further emphasized in a non\-stationary environment\. Table[7](https://arxiv.org/html/2606.28943#S6.T7)shows results against an adversary that switches its bidding strategy periodically\. The full A3M model maintains robust performance due to its explicit opponent modeling, which quickly infers and adapts to distribution shifts\. In stark contrast, theA3M w/o ARvariant suffers a dramatic increase in regret, as its static model leads to persistent misalignment with the current adversary strategy\. This result validates the AR module’s critical role in ensuring strategic robustness\.

Table 7:Ablation Study: Performance against a Non\-Stationary Adversary \(Uniform Auction,K=5K=5,T=5000T=5000\)\.
#### Multi\-Objective Trade\-off Capability

A key advantage of A3M is its ability to navigate trade\-offs between learner utility and auctioneer revenue\. Table[8](https://arxiv.org/html/2606.28943#S6.T8)evaluates this capability\. The full A3M model, configured withλr=0\.3\\lambda\_\{r\}=0\.3, successfully reduces auctioneer revenue loss by approximately 14% compared to the pure\-utility variant \(w/o MO\), while sacrificing only a minimal amount of the learner’s own utility\. TheA3M w/o MOvariant, by design, cannot perform this trade\-off and achieves higher revenue loss\. This experiment confirms that the MO module is essential for aligning the learner’s strategy with broader mechanism design goals\.

Table 8:Ablation Study: Multi\-Objective Trade\-offs \(Uniform Auction,K=5K=5,T=5000T=5000\)\.
#### Adaptation to Instance Structure \(Δ\\Delta\-Separated\)

Finally, we test the variants’ ability to exploit favorable instance structures, specificallyΔ\\Delta\-separated distributions where the optimal bids lie in disjoint intervals\. Table[9](https://arxiv.org/html/2606.28943#S6.T9)shows the regret for different separation gapsΔ\\Delta\. The full A3M model excels across allΔ\\Deltavalues, particularly whenΔ\>0\\Delta\>0, demonstrating its adaptive learning module’s \(AL\) effectiveness in focusing exploration on the promising regions identified through the opponent model\. TheA3M w/o ALvariant, with its fixed exploration schedule, cannot leverage the instance structure efficiently and shows markedly higher regret, especially in the easier cases \(Δ=0\.2\\Delta=0\.2\)\. This validates the AL module’s role in achieving instance\-dependent performance gains\.

Table 9:Ablation Study: Performance onΔ\\Delta\-Separated Distributions \(Uniform Auction,K=3K=3,T=3000T=3000\)\.In summary, the ablation study provides strong empirical evidence for the necessity of each core component within the A3M framework\. TheAdaptive Learning \(AL\)module is fundamental for dynamic strategy optimization and enables instance\-adaptive performance\. TheAdversarial Reasoning \(AR\)module is crucial for robustness against strategic and non\-stationary opponents\. TheMulti\-Objective Reward \(MO\)module allows for flexible trade\-offs beyond pure utility maximization\. Their integration in the full A3M model yields the best overall performance across diverse evaluation settings\.

## 7Additional Evaluation

### 7\.1Training Dynamics and Convergence Analysis

![Refer to caption](https://arxiv.org/html/2606.28943v1/x4.png)Figure 6:Regret convergence comparison across algorithms\. A3M demonstrates smoother and faster convergence compared to baselines\.To understand the learning dynamics of the proposed A3M framework and compare it with baseline algorithms, we analyze the evolution of cumulative regret over time\. Figure[6](https://arxiv.org/html/2606.28943#S7.F6)illustrates the regret trajectories\. We plotRegret​\(t\)=∑τ=1t\(u​\(𝐛∗,𝜷τ\)−u​\(𝐛τ,𝜷τ\)\)\\text\{Regret\}\(t\)=\\sum\_\{\\tau=1\}^\{t\}\(u\(\\mathbf\{b\}^\{\*\},\\boldsymbol\{\\beta\}^\{\\tau\}\)\-u\(\\mathbf\{b\}^\{\\tau\},\\boldsymbol\{\\beta\}^\{\\tau\}\)\)fort∈\[T\]t\\in\[T\]across different experimental settings, including the standard stochastic environment, the non\-stationary adversarial environment, and theΔ\\Delta\-separated distribution scenario \(Section[5](https://arxiv.org/html/2606.28943#S5)\)\. As suggested by the main empirical results, the estimate\-then\-commit algorithm \(Algorithm[2](https://arxiv.org/html/2606.28943#alg2)\) exhibits a distinct two\-phase pattern: a phase of approximately linear regret growth during its fixed exploration period, followed by a slower growth rate during exploitation\. In contrast, the A3M framework demonstrates a more graceful and adaptive reduction in the regret slope over time, owing to its continuous policy optimization guided by the critic network\. Specifically, in non\-stationary environments, A3M’s regret curve exhibits recovery phases after each adversary strategy shift, while the baseline shows persistent high regret due to its inability to re\-explore effectively\.

### 7\.2Case Study: Visualization of Multi\-Objective Trade\-offs

![Refer to caption](https://arxiv.org/html/2606.28943v1/x5.png)Figure 7:Multi\-objective trade\-off analysis\. Differentλ\\lambdaconfigurations yield different balances between learner utility and auctioneer revenue loss\.A key advantage of the A3M framework is its tunable reward functionRt​\(𝝀\)R\_\{t\}\(\\boldsymbol\{\\lambda\}\), which enables the study of trade\-offs between learner utility, auctioneer revenue, and fairness\. Figure[7](https://arxiv.org/html/2606.28943#S7.F7)visualizes these trade\-offs\. To qualitatively illustrate how different reward weights𝝀\\boldsymbol\{\\lambda\}shape the learned bidding strategy, we conduct a case study focusing on the final learned policy\. We fix a specific stochastic adversary distribution and run A3M to convergence under three configurations: \(1\) Pure utility maximization \(λu=1,λr=0,λf=0\\lambda\_\{u\}=1,\\lambda\_\{r\}=0,\\lambda\_\{f\}=0\), \(2\) Utility\-revenue balance \(λu=1,λr=0\.3,λf=0\\lambda\_\{u\}=1,\\lambda\_\{r\}=0\.3,\\lambda\_\{f\}=0\), and \(3\) Utility\-fairness balance \(λu=1,λr=0,λf=0\.2\\lambda\_\{u\}=1,\\lambda\_\{r\}=0,\\lambda\_\{f\}=0\.2\)\. For each configuration, we visualize the final learned bidding functionϕ​\(k\)\\phi\(k\)fork∈\[K\]k\\in\[K\]\. Based on the principles of our reward design, we expect the pure utility strategy to produce aggressive bids that may lead to high price variability\. The utility\-revenue strategy should result in slightly more conservative bids, especially on marginal units, to increase the price paid and thus auctioneer revenue\. The utility\-fairness strategy should produce bids closer to the true marginal valuesvkv\_\{k\}for pivotal units to reduce the fairness penalty\. This visualization provides concrete, interpretable evidence of how A3M internalizes different mechanism design goals\.

### 7\.3Extension: Performance under I\.I\.D\. Adversaries

![Refer to caption](https://arxiv.org/html/2606.28943v1/x6.png)Figure 8:Performance comparison under i\.i\.d\. adversaries\. A3M achievesO~​\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)regret comparable to the specialized UBIID algorithm, while Est\.\-Then\-Commit exhibitsO~​\(T2/3\)\\tilde\{O\}\(T^\{2/3\}\)scaling\.The theoretical analysis in Section[5](https://arxiv.org/html/2606.28943#S5)establishes that for i\.i\.d\. adversaries in the uniform price auction, specialized algorithms like UBIID \(Algorithm[4](https://arxiv.org/html/2606.28943#alg4)\) can achieve𝒪~​\(T\)\\tilde\{\\mathcal\{O\}\}\(\\sqrt\{T\}\)regret, a significant improvement over the worst\-caseT2/3T^\{2/3\}rate\. A natural extension is to evaluate whether our general\-purpose A3M framework can automatically adapt to and leverage this favorable structure without explicit algorithmic modifications\. Figure[8](https://arxiv.org/html/2606.28943#S7.F8)presents the results of this experiment\. We design an experiment where the adversary’s bids𝜷t\\boldsymbol\{\\beta\}^\{t\}are generated as the topKKorder statistics ofNNi\.i\.d\. samples from a distribution𝒫\\mathcal\{P\}\(e\.g\.,Beta​\(2,5\)\\text\{Beta\}\(2,5\)\)\. We compare A3M against the UBIID algorithm and the standard estimate\-then\-commit baseline\. As shown in Figure[8](https://arxiv.org/html/2606.28943#S7.F8), A3M, through its opponent modelgϕg\_\{\\phi\}, learns to infer the underlying distribution𝒫\\mathcal\{P\}and the relationships between order statistics, achieving regret scaling competitive with UBIID\. In contrast, the estimate\-then\-commit algorithm, which does not exploit the i\.i\.d\. structure, exhibits its standardO~​\(T2/3\)\\tilde\{O\}\(T^\{2/3\}\)regret growth\. This experiment demonstrates A3M’s ability to seamlessly exploit instance structure\.

### 7\.4Robustness to Non\-Stationary Adversaries

![Refer to caption](https://arxiv.org/html/2606.28943v1/x7.png)Figure 9:Robustness comparison under non\-stationary adversaries with strategy shifts att=1000,2000,3000,4000t=1000,2000,3000,4000\. A3M quickly recovers after each shift, while Est\.\-Then\-Commit accumulates persistent regret\.To further validate A3M’s adversarial reasoning capabilities, we evaluate performance against a non\-stationary adversary that periodically changes its bidding strategy\. Figure[9](https://arxiv.org/html/2606.28943#S7.F9)shows the regret trajectories when the adversary shifts its distribution att=1000,2000,3000,4000t=1000,2000,3000,4000\. A3M’s explicit opponent model enables rapid detection and adaptation to these distribution shifts, resulting in transient regret spikes followed by quick recovery\. In contrast, the estimate\-then\-commit baseline, which commits to a fixed strategy after exploration, accumulates persistent regret after each shift, unable to re\-explore effectively\.

### 7\.5Parameter Sensitivity Analysis

![Refer to caption](https://arxiv.org/html/2606.28943v1/x8.png)Figure 10:Parameter sensitivity analysis: \(a\) learning rate, \(b\) discount factorγ\\gamma, and \(c\) replay buffer size\. Red dashed lines indicate the default configuration\. A3M is robust across a wide range of hyperparameter settings\.We conduct a sensitivity analysis to understand how A3M’s performance varies with key hyperparameters\. Figure[10](https://arxiv.org/html/2606.28943#S7.F10)shows the final regret as a function of \(a\) learning rate, \(b\) discount factorγ\\gamma, and \(c\) replay buffer size\. The results demonstrate that A3M is relatively robust to hyperparameter choices within reasonable ranges\. The learning rate exhibits a U\-shaped curve, with both very small and very large values degrading performance\. The discount factorγ=0\.99\\gamma=0\.99provides a good balance between short\-term and long\-term rewards\. Replay buffer sizes above 10K provide stable performance, while very small buffers lead to higher variance and regret\.

### 7\.6Performance Across Auction Types

![Refer to caption](https://arxiv.org/html/2606.28943v1/x9.png)Figure 11:Performance comparison across uniform price and discriminatory price auction formats\. A3M consistently outperforms baselines in both settings\.Figure[11](https://arxiv.org/html/2606.28943#S7.F11)compares the performance of all algorithms across both auction formats\. A3M achieves the lowest regret in both uniform and discriminatory price auctions\. Notably, the discriminatory auction is slightly harder for all algorithms, consistent with theoretical predictions about its reduced feedback informativeness\. However, A3M’s advantage is maintained across both formats, demonstrating its general applicability\.

### 7\.7Regret Scaling with Time Horizon

![Refer to caption](https://arxiv.org/html/2606.28943v1/x10.png)Figure 12:Final regret vs\. time horizonTTon log\-log scale\. Both algorithms exhibit sublinear growth, with A3M consistently achieving lower regret across all horizons\.Finally, we analyze how regret scales with the time horizonTT\. Figure[12](https://arxiv.org/html/2606.28943#S7.F12)plots the final regret againstTTon a log\-log scale\. Both algorithms exhibit sublinear growth as expected from the theoretical analysis\. A3M consistently achieves lower regret across all time horizons, with the gap widening for largerTT, indicating better asymptotic performance\.

## 8Conclusion

This paper addresses the problem of learning to bid in repeated multi\-unit auctions under bandit feedback\. We propose theA3M\(Adaptive, Adversarial & Multi\-Objective\) framework, which represents a paradigm shift from traditional estimate\-then\-commit approaches\. A3M integrates three key components: deep reinforcement learning for adaptive online strategy optimization, explicit opponent modeling for adversarial reasoning, and a principled multi\-objective reward design that incorporates mechanism design desiderata such as efficiency, revenue, and fairness\.

Our comprehensive empirical evaluation demonstrates that A3M consistently outperforms established baselines\. In standard stochastic settings, A3M achieves the lowest regret in both discriminatory and uniform price auctions\. Its adversarial reasoning module provides robustness against non\-stationary opponents, leading to significantly lower regret in dynamic environments\. Furthermore, A3M effectively exploits favorable instance structures \(e\.g\.,Δ\\Delta\-separated distributions\) and scales more gracefully with the number of unitsKKthan algorithms with fixed exploration schedules\. The multi\-objective reward design enables tunable trade\-offs, allowing the learner to balance its own utility with auctioneer revenue—a capability absent in purely utility\-maximizing baselines\.

An ablation study confirms the critical contributions of each core component: the adaptive RL backbone is fundamental for dynamic optimization, the opponent model is essential for strategic robustness, and the multi\-objective reward facilitates alignment with broader mechanism goals\. Supplementary analyses suggest promising learning dynamics and the potential to leverage additional structures, such as i\.i\.d\. adversaries\.

Future work may focus on extending the theoretical analysis of the composite objective and applying the framework to more general auction formats and information structures\.

## References

- No\-regret learning in repeated first\-price auctions with budget constraints\.arXiv preprint arXiv:2205\.14572\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- K\. Amin, A\. Rostamizadeh, and U\. Syed \(2013\)Learning prices for repeated auctions with strategic buyers\.InAdvances in Neural Information Processing Systems,Vol\.26\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- L\. M\. Ausubel, P\. Cramton, M\. Pycia, M\. Rostek, and M\. Weretka \(2014\)Demand reduction and inefficiency in multi\-unit auctions\.The Review of Economic Studies81\(4\),pp\. 1366–1400\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- A\. Badanidiyuru, Z\. Feng, and G\. Guruganesh \(2021\)Learning to bid in revenue\-maximizing auctions\.Proceedings of Machine Learning Research139,pp\. 516–525\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Z\. Bai, E\. Ge, and J\. Hao \(2025\)Multi\-agent collaborative framework for intelligent it operations: an aoi system with context\-aware compression and dynamic task scheduling\.arXiv preprint arXiv:2512\.13956\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- S\. R\. Balseiro and Y\. Gur \(2019\)Learning in repeated auctions with budgets: regret minimization and equilibrium\.Management Science65\(9\),pp\. 3952–3968\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- S\. R\. Balseiro, A\. Kim, M\. Mahdian, and V\. Mirrokni \(2017\)Budget management strategies in repeated auctions\.InProceedings of the 26th International Conference on World Wide Web,pp\. 15–24\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Z\. Bi, L\. Chen, J\. Song, H\. Luo, E\. Ge, J\. Huang, T\. Wang, K\. Chen, C\. X\. Liang, Z\. Wei,et al\.\(2025\)Exploring efficiency frontiers of thinking budget in medical reasoning: scaling laws between computational resources and reasoning quality\.arXiv:2508\.12140\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- M\. Brenner, D\. Galai, and O\. Sade \(2009\)Comparison of auction formats: a cross\-country perspective\.European Financial Management15\(2\),pp\. 349–374\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- G\. W\. Brown \(1951\)Iterative solution of games by fictitious play\.Activity Analysis of Production and Allocation13\(1\),pp\. 374–376\.Cited by:[§3\.2](https://arxiv.org/html/2606.28943#S3.SS2.p3.1)\.
- Z\. Cao, Y\. He, A\. Liu, J\. Xie, Z\. Wang, and F\. Chen \(2025a\)CoFi\-dec: hallucination\-resistant decoding via coarse\-to\-fine generative feedback in large vision\-language models\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 10709–10718\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- Z\. Cao, Y\. He, A\. Liu, J\. Xie, Z\. Wang, and F\. Chen \(2025b\)PurifyGen: a risk\-discrimination and semantic\-purification model for safe text\-to\-image generation\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 816–825\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- H\. Chen, J\. Peng, D\. Min, C\. Sun, K\. Chen, Y\. Yan, X\. Yang, and L\. Cheng \(2025a\)Mvi\-bench: a comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms\.arXiv preprint arXiv:2511\.14159\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- K\. Chen, Z\. Lin, Z\. Xu, Y\. Shen, Y\. Yao, J\. Rimchala, J\. Zhang, and L\. Huang \(2025b\)R2i\-bench: benchmarking reasoning\-driven text\-to\-image generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 12606–12641\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- K\. Chen, Z\. Xu, Y\. Shen, Z\. Lin, Y\. Yao, and L\. Huang \(2025c\)SuperFlow: training flow matching models with rl on the fly\.arXiv preprint arXiv:2512\.17951\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- E\. H\. Clarke \(1971\)Multipart pricing of public goods\.Public Choice11\(1\),pp\. 17–33\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- X\. Deng \(2025\)Enhancing neural network performance on tabular data via knowledge distillation and rankgauss transformation\.In2025 6th International Conference on Big Data & Artificial Intelligence & Software Engineering \(ICBASE\),pp\. 418–423\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- E\. Even\-Dar, S\. Mannor, and Y\. Mansour \(2006\)Action elimination and stopping conditions for the multi\-armed bandit and reinforcement learning problems\.Journal of Machine Learning Research7,pp\. 1079–1105\.Cited by:[§5\.1](https://arxiv.org/html/2606.28943#S5.SS1.p2.2)\.
- Z\. Feng, C\. Podimata, and V\. Syrgkanis \(2018\)Learning to bid without knowing your value\.InProceedings of the 2018 ACM Conference on Economics and Computation,pp\. 505–522\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- N\. Golrezaei, A\. Javanmard, and V\. Mirrokni \(2021\)Learning to bid in multi\-unit auctions with bandit feedback\.InProceedings of the 22nd ACM Conference on Economics and Computation,pp\. 505–506\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- T\. Groves \(1973\)Incentives in teams\.Econometrica41\(4\),pp\. 617–631\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- X\. Han, X\. Gao, X\. Qu, and Z\. Yu \(2025\)Multi\-agent medical decision consensus matrix system: an intelligent collaborative framework for oncology mdt consultations\.arXiv preprint arXiv:2512\.14321\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Y\. Han, Z\. Zhou, and T\. Weissman \(2024\)Optimal no\-regret learning in repeated first\-price auctions\.Operations Research73\(1\),pp\. 209–238\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.28943#S4.SS1.p1.2)\.
- Y\. He, S\. Li, K\. Li, J\. Wang, B\. Li, T\. Shi, Y\. Xin, K\. Li, J\. Yin, M\. Zhang,et al\.\(2025\)GE\-adapter: a general and efficient adapter for enhanced video editing with pretrained text\-to\-image diffusion models\.Expert Systems with Applications,pp\. 129649\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- J\. Heinrich, M\. Lanctot, and D\. Silver \(2015\)Fictitious self\-play in extensive\-form games\.InInternational Conference on Machine Learning,pp\. 805–813\.Cited by:[§3\.2](https://arxiv.org/html/2606.28943#S3.SS2.p3.1)\.
- W\. Hsieh, Z\. Bi, C\. Jiang, J\. Liu, B\. Peng, S\. Zhang, X\. Pan, J\. Xu, J\. Wang, K\. Chen,et al\.\(2024\)A comprehensive guide to explainable ai: from classical models to llms\.arXiv:2412\.00800\.Cited by:[§3\.5](https://arxiv.org/html/2606.28943#S3.SS5.p2.1)\.
- Y\. Huang, B\. Li, N\. Li, Z\. Wang, K\. Chen, H\. Ge, Q\. Si, Y\. Shen, R\. Yang, G\. Wang,et al\.\(2026\)GUI agents for continual game generation\.arXiv preprint arXiv:2605\.28258\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- Y\. Kanoria and H\. Nazerzadeh \(2021\)Dynamic reserve prices for repeated auctions: learning from bids\.Operations Research69\(1\),pp\. 252–268\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- V\. R\. Konda and J\. N\. Tsitsiklis \(2000\)Actor\-critic algorithms\.Advances in Neural Information Processing Systems12,pp\. 1008–1014\.Cited by:[§3\.4](https://arxiv.org/html/2606.28943#S3.SS4.p1.1)\.
- V\. Krishna \(2009\)Auction theory\.2nd edition,Academic Press\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- M\. Li, Z\. Bi, T\. Wang, Y\. Wen, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, X\. Pan, J\. Xu,et al\.\(2024\)Deep learning and machine learning with gpgpu and cuda: unlocking the power of parallel computing\.arXiv:2410\.05686\.Cited by:[§3\.4](https://arxiv.org/html/2606.28943#S3.SS4.p1.1)\.
- S\. Lin \(2025a\)Abductive inference in retrieval\-augmented language models: generating and validating missing premises\.External Links:2511\.04020,[Link](https://arxiv.org/abs/2511.04020)Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- S\. Lin \(2025b\)Hybrid fuzzing with llm\-guided input mutation and semantic feedback\.External Links:2511\.03995,[Link](https://arxiv.org/abs/2511.03995)Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- S\. Lin \(2025c\)LLM\-driven adaptive source\-sink identification and false positive mitigation for static analysis\.External Links:2511\.04023,[Link](https://arxiv.org/abs/2511.04023)Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- P\. R\. Milgrom and R\. J\. Weber \(1982\)A theory of auctions and competitive bidding\.Econometrica50\(5\),pp\. 1089–1122\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- P\. Milgrom \(2004\)Putting auction theory to work\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- V\. Mnih, A\. P\. Badia, M\. Mirza, A\. Graves, T\. Lillicrap, T\. Harley, D\. Silver, and K\. Kavukcuoglu \(2016\)Asynchronous methods for deep reinforcement learning\.International Conference on Machine Learning,pp\. 1928–1937\.Cited by:[§3\.4](https://arxiv.org/html/2606.28943#S3.SS4.p1.1)\.
- M\. Mo, Y\. Tan, H\. Zhang, H\. Zhang, and Y\. He \(2026\)ShieldedCode: learning robust representations for virtual machine protected code\.arXiv preprint arXiv:2601\.20679\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- W\. Mu\-Jiang\-shan, Y\. Jun, L\. Shang\-wei,et al\.\(2010\)Ordered and hamilton digraphs\.Chinese Quarterly Journal of Mathematics25\(3\),pp\. 317–326\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Q\. Niu, K\. Chen, M\. Li, P\. Feng, Z\. Bi, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, J\. Liu, B\. Peng, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu \(2024a\)From text to multimodality: exploring the evolution and impact of large language models in medical practice\.External Links:2410\.01812,[Link](https://arxiv.org/abs/2410.01812)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Q\. Niu, J\. Liu, Z\. Bi, P\. Feng, B\. Peng, K\. Chen, M\. Li, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu \(2024b\)Large language models and cognitive science: a comprehensive review of similarities, differences, and challenges\.External Links:2409\.02387,[Link](https://arxiv.org/abs/2409.02387)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- K\. G\. Nyborg and S\. Sundaresan \(1996\)Discriminatory versus uniform treasury auctions: evidence from when\-issued transactions\.Journal of Financial Economics42\(1\),pp\. 63–104\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- C\. Pan, Y\. Qu, Y\. Yao, and M\. Wang \(2024\)HybridGNN: a self\-supervised graph neural network for efficient maximum matching in bipartite graphs\.Symmetry16\(12\),pp\. 1631\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- H\. Qi, Z\. Hu, Z\. Yang, J\. Zhang, J\. J\. Wu, C\. Cheng, C\. Wang, and L\. Zheng \(2022\)Capacitive aptasensor coupled with microfluidic enrichment for real\-time detection of trace sars\-cov\-2 nucleocapsid protein\.Analytical chemistry94\(6\),pp\. 2812–2819\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p3.1)\.
- D\. Qu and Y\. Ma \(2025\)Magnet\-bn: markov\-guided bayesian neural networks for calibrated long\-horizon sequence forecasting and community tracking\.Mathematics13\(17\),pp\. 2740\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- X\. Song, K\. Chen, Z\. Bi, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, M\. Liu, M\. Li, X\. Pan,et al\.\(2025a\)Mastering reinforcement learning: foundations, algorithms, and real\-world applications\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p3.1)\.
- X\. Song, K\. Chen, Z\. Bi, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, Z\. Yuan, M\. Liu, M\. Li,et al\.\(2025b\)Transformer: a survey and application\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Y\. Tian, Z\. Yang, C\. Liu, Y\. Su, Z\. Hong, Z\. Gong, and J\. Xu \(2025\)CenterMamba\-sam: center\-prioritized scanning and temporal prototypes for brain lesion segmentation\.External Links:2511\.01243,[Link](https://arxiv.org/abs/2511.01243)Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p3.1)\.
- W\. Vickrey \(1961\)Counterspeculation, auctions, and competitive sealed tenders\.The Journal of Finance16\(1\),pp\. 8–37\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- H\. Wang, X\. Zhang, Y\. Xia, and X\. Wu \(2023\)An intelligent blockchain\-based access control framework with federated learning for genome\-wide association studies\.Computer Standards & Interfaces84,pp\. 103694\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- M\. Wang, W\. Yang, and S\. Wang \(2013\)Conditional matching preclusion number for the cayley graph on the symmetric group\.Acta Math\. Appl\. Sin\.\(Chinese Series\)36\(5\),pp\. 813–820\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- S\. Wang, M\. Wang, K\. Feng, S\. Lin, and M\. Zhang \(2012\)Relation of the isolated scattering number of a graph and its complement graph\.Journal of Shanxi University \(Natural Science Edition\)35\(2\),pp\. 206–210\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- S\. Wang, J\. Wangmu, Z\. Qi, and Y\. Ren \(2011\)Embedding paths into the 4\-ary n\-cube with faulty nodes\.In2011 International Conference on Consumer Electronics, Communications and Networks \(CECNet\),pp\. 4949–4951\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- T\. Wang, Z\. Bi, Y\. Zhang, M\. Liu, W\. Hsieh, P\. Feng, L\. K\. Yan, Y\. Wen, B\. Peng, J\. Liu,et al\.\(2024a\)Deep learning model security: threats and defenses\.InarXiv:2412\.08969,Cited by:[§4\.4](https://arxiv.org/html/2606.28943#S4.SS4.SSS0.Px2.p1.1)\.
- T\. Wang, S\. Chen, Y\. Wang, Y\. Zhang, X\. Song, Z\. Bi, M\. Liu, Q\. Niu, J\. Liu, P\. Feng, X\. Sun, B\. Peng, C\. Zhang, K\. Chen, M\. Li, C\. Fei, and L\. K\. Yan \(2025\)From in silico to in vitro: a comprehensive guide to validating bioinformatics findings\.External Links:2502\.03478,[Link](https://arxiv.org/abs/2502.03478)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- T\. Wang, M\. Liu, B\. Peng, X\. Song, C\. Zhang, X\. Sun, Q\. Niu, J\. Liu, S\. Chen, K\. Chen, M\. Li, P\. Feng, Z\. Bi, Y\. Wang, Y\. Zhang, C\. Fei, and L\. K\. Yan \(2024b\)From bench to bedside: a review of clinical trials in drug discovery and development\.External Links:2412\.09378,[Link](https://arxiv.org/abs/2412.09378)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Y\. Wang and S\. Sayil \(2024\)Soft error evaluation and mitigation in gate diffusion input circuits\.In2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems \(ICPICS\),pp\. 121–128\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Y\. Wang \(2024\)Low\-power design of advanced image processing algorithms under fpga in real\-time applications\.In2024 IEEE 4th International Conference on Power, Electronics and Computer Applications \(ICPECA\),pp\. 1080–1084\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- Y\. Wang \(2025\)Zynq soc\-based acceleration of retinal blood vessel diameter measurement\.Archives of Advanced Engineering Science,pp\. 1–9\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- J\. Weed, V\. Perchet, and P\. Rigollet \(2016\)Online learning in repeated auctions\.InConference on Learning Theory \(COLT\),pp\. 1562–1583\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Z\. Wei, H\. An, Y\. Yao, W\. Su, G\. Li, Saifullah, B\. Sun, and M\. Wang \(2025a\)FSTGAT: financial spatio\-temporal graph attention network for non\-stationary financial systems and its application in stock price prediction\.Symmetry17\(8\),pp\. 1344\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Z\. Wei, P\. Hu, S\. Lang, H\. Yan, L\. Mei, Y\. Zhang, C\. Yang, J\. Hao, and Z\. Han \(2025b\)Automated red\-teaming framework for large language model security assessment: a comprehensive attack generation and detection system\.arXiv preprint arXiv:2512\.20677\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- X\. Wu, J\. Dong, W\. Bao, B\. Zou, L\. Wang, and H\. Wang \(2024a\)Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading\.IEEE Internet of Things Journal11\(22\),pp\. 36030–36043\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- X\. Wu, H\. Wang, W\. Tan, D\. Wei, and M\. Shi \(2020\)Dynamic allocation strategy of vm resources with fuzzy transfer learning method\.Peer\-to\-Peer Networking and Applications13\(6\),pp\. 2201–2213\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p3.1)\.
- X\. Wu, H\. Wang, Y\. Zhang, B\. Zou, and H\. Hong \(2024b\)A tutorial\-generating method for autonomous online learning\.IEEE Transactions on Learning Technologies17,pp\. 1532–1541\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2),[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- X\. Wu, Y\. Zhang, K\. Lai, M\. Yang, G\. Yang, and H\. Wang \(2024c\)A novel centralized federated deep fuzzy neural network with multi\-objectives neural architecture search for epistatic detection\.IEEE Transactions on Fuzzy Systems33\(1\),pp\. 94–107\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- X\. Wu, Y\. Zhang, M\. Shi, P\. Li, R\. Li, and N\. N\. Xiong \(2022\)An adaptive federated learning scheme with differential privacy preserving\.Future Generation Computer Systems127,pp\. 362–372\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- D\. Xiang, S\. Hsieh,et al\.\(2025\)G\-good\-neighbor diagnosability under the modified comparison model for multiprocessor systems\.Theoretical Computer Science1028,pp\. 115027\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- Y\. Xin, J\. Du, Q\. Wang, Z\. Lin, and K\. Yan \(2024\)Vmt\-adapter: parameter\-efficient transfer learning for multi\-task dense scene understanding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16085–16093\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- Y\. Xin, Q\. Qin, S\. Luo, K\. Zhu, J\. Yan, Y\. Tai, J\. Lei, Y\. Cao, K\. Wang, Y\. Wang,et al\.\(2025a\)Lumina\-dimoo: an omni diffusion large language model for multi\-modal generation and understanding\.arXiv preprint arXiv:2510\.06308\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- Y\. Xin, J\. Yan, Q\. Qin, Z\. Li, D\. Liu, S\. Li, V\. S\. Huang, Y\. Zhou, R\. Zhang, L\. Zhuo,et al\.\(2025b\)Lumina\-mgpt 2\.0: stand\-alone autoregressive image modeling\.arXiv preprint arXiv:2507\.17801\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- S\. Xu, H\. L\. Kao, T\. Xu, H\. Zhang, J\. Wang, R\. Ding, G\. Liu, T\. Shi, Z\. Yu, G\. Pan,et al\.\(2025\)Adaptive detector\-verifier framework for zero\-shot polyp detection in open\-world settings\.arXiv preprint arXiv:2512\.12492\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- L\. K\. Q\. Yan, Q\. Niu, M\. Li, Y\. Zhang, C\. H\. Yin, C\. Fei, B\. Peng, Z\. Bi, P\. Feng, K\. Chen, T\. Wang, Y\. Wang, S\. Chen, M\. Liu, J\. Liu, X\. Song, R\. Bao, Z\. Jiang, and Z\. Qin \(2025\)Large language model benchmarks in medical tasks\.External Links:2410\.21348,[Link](https://arxiv.org/abs/2410.21348)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- C\. Yang, Y\. He, A\. X\. Tian, D\. Chen, J\. Wang, T\. Shi, A\. Heydarian, and P\. Liu \(2025\)Wcdt: world\-centric diffusion transformer for traffic scene generation\.In2025 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6566–6572\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- M\. You, K\. Chen, and D\. Cheng \(2026\)Drdgrl: dual\-relational dynamic graph representation learning for delay\-sensitive stock trend prediction\.InInternational Conference on Database Systems for Advanced Applications,pp\. 35–50\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- W\. You, Z\. Yu, Z\. Han, X\. Liu, and Y\. Zhang \(2025\)Large language models for enhanced user experience in virtual and augmented reality: a comprehensive framework for ranking and recommendation systems\.Available at SSRN 5964834\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- L\. Yu, X\. Han, Y\. Kang, C\. Tseng, D\. Zhang, Z\. Bi, and Z\. Han \(2025\)Affective multimodal agents with proactive knowledge grounding for emotionally aligned marketing dialogue\.arXiv preprint arXiv:2511\.21728\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p3.5)\.
- W\. Yu, S\. Wei, J\. Liu, Y\. Li, M\. Hu, A\. Liu, H\. Zhang, and I\. King \(2026\)Probability\-entropy calibration: an elastic indicator for adaptive fine\-tuning\.arXiv preprint arXiv:2602\.01745\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- Z\. Yu \(2025\)Ai for science: a comprehensive review on innovations, challenges, and future directions\.International Journal of Artificial Intelligence for Science \(IJAI4S\)1\(1\)\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.
- H\. Zhang, B\. Huang, Z\. Li, X\. Xiao, H\. Y\. Leong, Z\. Zhang, X\. Long, T\. Wang, and H\. Xu \(2025a\)Sensitivity\-lora: low\-load sensitivity\-based fine\-tuning for large language models\.arXiv preprint arXiv:2509\.09119\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- H\. Zhang, Z\. Li, R\. Bao, Y\. Gao, X\. Xiao, H\. Zhang, S\. Zhang, B\. Huang, Y\. Wu, T\. Wang,et al\.\(2025b\)HyperAdaLoRA: accelerating lora rank allocation during training via hypernetworks without sacrificing performance\.arXiv preprint arXiv:2510\.02630\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- H\. Zhang, M\. Lyu, Z\. Chen, X\. Xing, Y\. Ao, and Y\. Lin \(2025c\)Pdtrim: targeted pruning for prefill\-decode disaggregation in inference\.arXiv preprint arXiv:2509\.04467\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- H\. Zhang, M\. Lyu, C\. He, Y\. Ao, and Y\. Lin \(2025d\)Trimtokenator: towards adaptive visual token pruning for large multimodal models\.arXiv preprint arXiv:2509\.00320\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- H\. Zhang, M\. Lyu, B\. Huang, Y\. Ao, and Y\. Lin \(2025e\)TrimTokenator\-lc: towards adaptive visual token pruning for large multimodal models with long contexts\.arXiv preprint arXiv:2512\.22748\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- H\. Zhang, X\. Mao, G\. Dong, Z\. Li, X\. Su, K\. Chen, J\. Yang, and Z\. Lin \(2026a\)MemMark: state\-evolution attribution watermarking for agent long\-term memory systems\.arXiv preprint arXiv:2605\.25002\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- H\. Zhang, H\. You, Z\. Zhang, L\. Gan, H\. Zhang, W\. Huang, and J\. Huang \(2026b\)Mitigating generic token dominance in cross\-domain foundation model for text\-attributed graphs\.InInternational Conference on Database Systems for Advanced Applications,pp\. 251–265\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p5.2)\.
- Y\. Zhang, N\. Deng, X\. Song, Z\. Bi, T\. Wang, Z\. Yao, K\. Chen, M\. Li, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, M\. Liu, L\. Zhang, X\. Pan, J\. Wang, P\. Feng, Y\. Wen, L\. K\. Yan, H\. Tseng, Y\. Zhong, Y\. Wang, Z\. Qin, B\. Jing, J\. Yang, J\. Zhou, C\. X\. Liang, and J\. Song \(2025f\)Advanced deep learning methods for protein structure prediction and design\.External Links:2503\.13522,[Link](https://arxiv.org/abs/2503.13522)Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p2.1)\.
- Q\. Zhao, Z\. Dou, D\. Zhang, X\. Li, C\. Song, Z\. Wan, X\. Li, Y\. Zhang, K\. Chen, Q\. Pan,et al\.\(2026\)STRIDE: strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning\.arXiv preprint arXiv:2606\.15866\.Cited by:[§1](https://arxiv.org/html/2606.28943#S1.p4.5)\.
- Y\. Zhou, Y\. He, Y\. Su, S\. Han, J\. Jang, G\. Bertasius, M\. Bansal, and H\. Yao \(2025\)ReAgent\-v: a reward\-driven multi\-agent framework for video understanding\.arXiv preprint arXiv:2506\.01300\.Cited by:[§2](https://arxiv.org/html/2606.28943#S2.p1.1)\.

Similar Articles

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.

Generative Auto-Bidding with Unified Modeling and Exploration

arXiv cs.AI

This paper introduces Guide, a framework that combines a Decision Transformer with Q-value guidance and an inverse dynamics module to balance exploration and safety in automated bidding for digital advertising, demonstrating effectiveness on public datasets and simulated auctions.

Regret Minimization with Adaptive Opponents in Repeated Games

Hugging Face Daily Papers

This paper introduces Repeated Policy Regret (RP-Regret), a game-theoretic metric for regret minimization in repeated games with adaptive opponents, and proposes three algorithms to minimize it, showing that doing so can lead to cooperative equilibria like in Stag-Hunt.

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Hugging Face Daily Papers

Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.

ALSO: Adversarial Online Strategy Optimization for Social Agents

arXiv cs.AI

ALSO introduces a framework for online strategy optimization in multi-agent social simulation, formulating multi-turn interaction as an adversarial bandit problem and using a neural surrogate for reward prediction. Experiments on the Sotopia benchmark show it outperforms static baselines and existing optimization methods.