Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
Proposes a truthful online preference aggregation mechanism for LLM fine-tuning in mobile crowdsourcing, addressing strategic worker misreporting and achieving sublinear regret.
arXiv:2605.24052v1 Announce Type: new Abstract: To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret $\mathcal{O}(T)$ over $T$ time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret $\mathcal{O}(\sqrt{T})$. Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.
Original Article
View Cached Full Text
Cached at: 05/26/26, 08:59 AM
# Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing
Source: [https://arxiv.org/html/2605.24052](https://arxiv.org/html/2605.24052)
Shugang Hao, and Lingjie DuanPart of this work has appeared in IEEE ICASSP 2025\[[1](https://arxiv.org/html/2605.24052#bib.bib1)\]\.The research of Lingjie Duan was supported by Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things \(No\. 2023B1212010007\)\.Shugang Hao is with the Singapore Wireless Innovation Centre, Singapore University of Technology and Design, Singapore, 487372 Singapore\. Lingjie Duan is with the Internet of Things Thrust and the Artificial Intelligence Thrust, Hong Kong University of Science and Technology, Guangzhou 511455, China\. E\-mail: shugang\_hao@sutd\.edu\.sg, lingjieduan@hkust\-gz\.edu\.cn\. \(Corresponding author: Lingjie Duan\.\)

###### Abstract

To better serve users’ demands in mobile applications \(e\.g\., navigation\), mobile crowdsourcing platforms can iteratively align large language model \(LLM\)\-generated content \(e\.g\., AI\-generated traffic condition predictions\) with human feedback collected from crowdsourcing workers \(e\.g\., mobile users\)\. However, workers may strategically misreport their online preference feedback to maximize their influence or payment\. Existing pipelines in mobile crowdsourcing \(e\.g\., EM\-based weight estimation\) fail to identify the most accurate worker in this online setting, resulting in a linear regret𝒪\(T\)\\mathcal\{O\}\(T\)overTTtime slots\. In this paper, we study truthful online preference aggregation for LLM fine\-tuning in mobile crowdsourcing\. We formulate a new dynamic Bayesian game to model the multi\-agent online learning process between the platform and strategic mobile workers\. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker’s weight in the preference aggregation according to their feedback accuracy\. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)overTTtime slots\. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)\. Experiments on LLM fine\-tuning with real\-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes\.

## IIntroduction

To better serve users’ demands in mobile applications, mobile crowdsourcing platforms can iteratively align large language model \(LLM\)\-generated content with human feedback collected from crowdsourcing workers \(e\.g\.,\[[2](https://arxiv.org/html/2605.24052#bib.bib2)\],\[[3](https://arxiv.org/html/2605.24052#bib.bib3)\]\)\. For example, navigation platforms \(e\.g\., Waze\) continuously collect human feedback on traffic conditions, routes, and system recommendations, providing human feedback required for iterative LLM alignment in dynamic mobile environments \(e\.g\.,\[[4](https://arxiv.org/html/2605.24052#bib.bib4),[5](https://arxiv.org/html/2605.24052#bib.bib5)\]\)\. Mobile conversational AI applications \(e\.g\., ChatGPT and Gemini mobile\) continuously collect users’ interaction feedback \(e\.g\., binary ratings, response regenerations, and follow\-up corrections\) to assess and improve system\-generated responses \(e\.g\.,\[[6](https://arxiv.org/html/2605.24052#bib.bib6)\],\[[7](https://arxiv.org/html/2605.24052#bib.bib7)\]\)\.

However, recent studies find that selfish workers may strategically misreport their online preference feedback to maximize their influence or payment \(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8),[9](https://arxiv.org/html/2605.24052#bib.bib9),[10](https://arxiv.org/html/2605.24052#bib.bib10),[11](https://arxiv.org/html/2605.24052#bib.bib11)\]\)\. For example, there is a renowned “wet bias” where a weather forecaster as a worker or a predictor may deliberately report an exaggerated probability of precipitation to increase the influence of his forecast in the weather forecasting platform’s final prediction \(e\.g\.,\[[12](https://arxiv.org/html/2605.24052#bib.bib12)\]\)\. Besides, a substantial number of Amazon Mechanical Turk \(MTurk\) workers are found to strategically misreport their responses to platform\-elicited screening questions \(e\.g\., falsely claiming required demographics, prior experience, or device ownership\) that are explicitly requested by the platform, aiming to increase their access or weight to higher\-paying tasks \(e\.g\.,\[[13](https://arxiv.org/html/2605.24052#bib.bib13),[14](https://arxiv.org/html/2605.24052#bib.bib14)\]\)\. Nevertheless, existing adaptive aggregation pipelines \(e\.g\., EM\-based weighting\[[15](https://arxiv.org/html/2605.24052#bib.bib15),[16](https://arxiv.org/html/2605.24052#bib.bib16)\]and Hedge\-style online learning\[[17](https://arxiv.org/html/2605.24052#bib.bib17),[18](https://arxiv.org/html/2605.24052#bib.bib18)\]\) largely assume passive or truthful reporting and ignore the possibility of strategic misreporting by workers\. Our first research question arises:

- •Q1\. How vulnerable is the current practice of LLM fine\-tuning against selfish workers?

Later, we prove that such current practice fails to identify the most accurate worker in the online learning process\. Recent work \(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8),[10](https://arxiv.org/html/2605.24052#bib.bib10),[9](https://arxiv.org/html/2605.24052#bib.bib9),[19](https://arxiv.org/html/2605.24052#bib.bib19),[20](https://arxiv.org/html/2605.24052#bib.bib20)\]\) proposes monetary mechanism design to elicit truthful preferences from strategic workers in LLM fine\-tuning\. However, such payment\-based mechanisms largely focus on one\-shot or offline preference elicitation and do not consider online interactions, where workers have more room to strategically misreport and play with the platform for long\-term influence\.

We also find some recent studies on online or iterative LLM alignment \(e\.g\.,\[[21](https://arxiv.org/html/2605.24052#bib.bib21)\],\[[22](https://arxiv.org/html/2605.24052#bib.bib22)\],\[[23](https://arxiv.org/html/2605.24052#bib.bib23)\]\), where the system performs LLM fine\-tuning using periodically\-collected human annotations\. Yet, these studies focus on preference feedback from a single worker and do not address diverse human feedback from multiple workers\. Further, they assume that a worker is always truthful to provide his real preference feedback, which does not capture the strategic misreport from multiple workers\.

In the related literature of algorithmic game theory, there are relevant non\-monetary mechanism studies on facility location games \(e\.g\.,\[[24](https://arxiv.org/html/2605.24052#bib.bib24)\],\[[25](https://arxiv.org/html/2605.24052#bib.bib25)\],\[[26](https://arxiv.org/html/2605.24052#bib.bib26)\]\), where the system aims to incentivize customers’ truthful reporting of their locations to optimize facility placement\. Each customer can strategically misreport his location to mislead the facility placement as close to his location \(preference\) as possible\. The popular “median” scheme \(e\.g\.,\[[11](https://arxiv.org/html/2605.24052#bib.bib11)\],\[[27](https://arxiv.org/html/2605.24052#bib.bib27)\]\) to aggregate multi\-agent reports is widely used to return customers’ truthful reporting\. Yet, later we prove that it can incur a non\-vanishing regret over time\. Thus, our second research question arises:

- •Q2\. How to design a truthful and regret\-efficient mechanism against selfish workers in LLM fine\-tuning for mobile crowdsourcing?

Note that motivating truthful feedback from workers while achieving vanishing regret is highly challenging\. First, workers’ true preferences are hidden and may vary across time, making it difficult for the platform to detect or correct misreports and reliably infer these preferences \(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8)\]\)\. Furthermore, because the most accurate worker is unknown and must be learned online, the platform finds it hard to dynamically assign weights in a way that guarantees vanishing regret in the presence of strategic behavior\.

We summarize our key novelty and main results as follows\.

- •Truthful online preference aggregation for LLM fine\-tuning in mobile crowdsourcing:In this work, we study the design of a truthful online preference aggregation mechanism in mobile crowdsourcing applications, where heterogeneous workers may strategically misreport their preference feedback to maximize their long\-term influence or payment\. The aggregated preferences serve as the human\-feedback dataset for iteratively fine\-tuning a downstream LLM at the platform\. Unlike the LLM literature studying either a single worker or offline preference feedback \(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8)\],\[[10](https://arxiv.org/html/2605.24052#bib.bib10)\],\[[9](https://arxiv.org/html/2605.24052#bib.bib9)\],\[[19](https://arxiv.org/html/2605.24052#bib.bib19)\],\[[21](https://arxiv.org/html/2605.24052#bib.bib21)\],\[[22](https://arxiv.org/html/2605.24052#bib.bib22)\],\[[23](https://arxiv.org/html/2605.24052#bib.bib23)\]\), we focus onhow a platform can incentivize selfish workers’ truthful feedback through online aggregation mechanism design\.
- •Non\-vanishing regrets of current practice:We prove that the current crowdsourcing practice \(e\.g\., EM\-based weight estimation\) fails to identify the most accurate worker and can lead to a non\-vanishing regret𝒪\(T\)\\mathcal\{O\}\(T\)overTTtime slots\. Further, we prove that the popular median scheme in the algorithmic game theory literature still incurs a linear regret𝒪\(T\)\\mathcal\{O\}\(T\)\.
- •Our novel truthful online weighted aggregation mechanism:We first formulate a new dynamic Bayesian game to model the multi\-agent online learning process between the platform and strategic workers\. We then propose a novel online weighted aggregation mechanism to dynamically adjust workers’ weights in the preference aggregation according to their feedback accuracy during the online learning process\. We prove that our mechanism guarantees workers’ truthful preference feedback and achieves a vanishing regret𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)overTTtime slots\. We further prove that our mechanism is responsive to new high\-quality workers under the uniform step\-sizeα\\alpha, and remains robust under bounded noisy verification of the ground\-truth system state\.
- •Extension to limited worker feedback:In practice, collecting feedback from multiple workers can be difficult due to cost and coordination challenges, which can in turn slow down the online learning process in LLM fine\-tuning \(e\.g\.,\[[28](https://arxiv.org/html/2605.24052#bib.bib28)\]\)\. We further extend to address a challenging scenario where only one worker’s preference feedback is available per time slot\. We propose a novel online mixed selection mechanism to ensure truthful feedback from any strategic worker while maintaining a sublinear regret𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)\. Experiments on LLM fine\-tuning based on real\-world datasets further demonstrate significant performance gains of our proposed mechanisms compared to benchmark schemes\.

The rest of this paper is organized as follows\. Section[II](https://arxiv.org/html/2605.24052#S2)reviews related work\. Section[III](https://arxiv.org/html/2605.24052#S3)introduces the system model and the dynamic Bayesian game formulation for online mobile crowdsourcing based on LLM fine\-tuning iterations\. Section[IV](https://arxiv.org/html/2605.24052#S4)analyzes three common schemes used in the literature as benchmarks for our mechanism to compare later\. Section[V](https://arxiv.org/html/2605.24052#S5)details our proposed mechanism design and analysis\. Section[VI](https://arxiv.org/html/2605.24052#S6)extends the framework to limited worker feedback\. Section[VII](https://arxiv.org/html/2605.24052#S7)presents experimental results on real\-world datasets\. Section[VIII](https://arxiv.org/html/2605.24052#S8)concludes\.

## IIRelated Work

In this section, we discuss four lines of existing work most relevant to our study\.

Online LLM alignment with human feedback\.Recent studies on online or iterative LLM alignment perform LLM fine\-tuning using periodically\-collected human annotations to keep the policy aligned with evolving user preferences\. Xiong et al\.\[[23](https://arxiv.org/html/2605.24052#bib.bib23)\]formulate iterative preference learning as a KL\-regularized optimization against a reference model and provide theoretical guarantees under the assumption of truthful single\-worker feedback\. Ye et al\.\[[22](https://arxiv.org/html/2605.24052#bib.bib22)\]extend this framework with general preference models and derive convergence rates under online interactions\. Dong et al\.\[[21](https://arxiv.org/html/2605.24052#bib.bib21)\]propose an online RLHF workflow that integrates reward modeling with iterative policy updates from streaming human feedback\. More recent work studies online direct alignment without an explicit reward model, including iterative DPO variants and online preference optimization under distribution shift \(e\.g\.,\[[22](https://arxiv.org/html/2605.24052#bib.bib22),[23](https://arxiv.org/html/2605.24052#bib.bib23)\]\)\. However, these studies focus on preference feedback from a single worker and assume that the worker is always truthful, which does not capture the strategic misreport that arises in mobile crowdsourcing where heterogeneous selfish workers compete for long\-term influence or payment\.

Monetary mechanism design for truthful preference elicitation\.Another line of recent work proposes monetary mechanism design to elicit truthful preferences from strategic workers in LLM fine\-tuning\. Sun et al\.\[[8](https://arxiv.org/html/2605.24052#bib.bib8)\]design payment mechanisms for fine\-tuning with multiple reward models, ensuring incentive compatibility through monetary transfers\. Soumalias et al\.\[[9](https://arxiv.org/html/2605.24052#bib.bib9)\]propose truthful aggregation mechanisms for LLMs in online advertising, where workers’ valuations are elicited through auction\-style payments\. Park et al\.\[[10](https://arxiv.org/html/2605.24052#bib.bib10)\]study heterogeneous feedback aggregation under personalization with monetary incentives\. Dubey et al\.\[[19](https://arxiv.org/html/2605.24052#bib.bib19)\]further develop auction mechanisms with LLM\-generated summaries, and Xu et al\.\[[20](https://arxiv.org/html/2605.24052#bib.bib20)\]design auction mechanisms for real\-time physical\-virtual synchronization in human\-centric metaverse settings\. However, such payment\-based mechanisms largely focus on one\-shot or offline preference elicitation and do not consider online interactions over time, where workers have more room to strategically misreport across iterations and shape long\-term outcomes\. In contrast, our work focuses on non\-monetary mechanism design under repeated interactions, where the platform incentivizes truthful feedback through dynamic weight adjustment rather than monetary transfers, which is more practical in mobile crowdsourcing where per\-query monetary settlement to mobile users is costly to implement\.

Non\-monetary mechanism design in algorithmic game theory\.In the algorithmic game theory literature, the median scheme on a one\-dimensional space is known to be group\-strategyproof when agents have single\-peaked preferences, dating back to the classical median voter result of Moulin\[[29](https://arxiv.org/html/2605.24052#bib.bib29)\]\. This foundational result has motivated a substantial body of work on truthful mechanisms without money, particularly in facility location games \(e\.g\.,\[[24](https://arxiv.org/html/2605.24052#bib.bib24),[25](https://arxiv.org/html/2605.24052#bib.bib25),[26](https://arxiv.org/html/2605.24052#bib.bib26)\]\), where each customer can strategically misreport his location to bias the placement toward his own preference, and the median scheme is widely used to elicit truthful reports \(e\.g\.,\[[11](https://arxiv.org/html/2605.24052#bib.bib11),[27](https://arxiv.org/html/2605.24052#bib.bib27)\]\)\. Recent extensions consider obnoxious facility location\[[26](https://arxiv.org/html/2605.24052#bib.bib26)\]and group\-fair variants under intra\-group externalities\[[27](https://arxiv.org/html/2605.24052#bib.bib27)\]\. A parallel line of work on peer prediction \(e\.g\.,\[[30](https://arxiv.org/html/2605.24052#bib.bib30),[31](https://arxiv.org/html/2605.24052#bib.bib31)\]\) and recent extensions to online and information\-elicitation\-without\-verification settings \(e\.g\.,\[[32](https://arxiv.org/html/2605.24052#bib.bib32),[33](https://arxiv.org/html/2605.24052#bib.bib33)\]\) further studies non\-monetary truthful elicitation when ground truth is unknown or delayed\. However, these mechanisms are designed for static or one\-shot decisions and do not address the online learning aspect with vanishing\-regret guarantees\. Moreover, we prove later in Section[IV\-C](https://arxiv.org/html/2605.24052#S4.SS3)that the median scheme incurs a non\-vanishing regret in our online mobile crowdsourcing setting, because it fails to give full credit to the most accurate worker even when all workers report truthfully\.

Adaptive weighting schemes in crowdsourcing and online learning\.In the crowdsourcing and online learning literature, several adaptive weighting schemes have been proposed to dynamically aggregate feedback from multiple workers\. The Dawid\-Skene model\[[34](https://arxiv.org/html/2605.24052#bib.bib34)\]is the foundational truth\-inference framework, and its EM\-based extensions \(e\.g\.,\[[15](https://arxiv.org/html/2605.24052#bib.bib15),[16](https://arxiv.org/html/2605.24052#bib.bib16)\]\) treat the true outcome as a hidden variable and iteratively estimate worker reliability via Expectation\-Maximization\. Hedge\-style online learning \(e\.g\.,\[[17](https://arxiv.org/html/2605.24052#bib.bib17),[18](https://arxiv.org/html/2605.24052#bib.bib18)\]\) updates worker weights via exponential decay on observed losses and provides𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret under truthful reporting\. EXP3 and its variants \(e\.g\.,\[[35](https://arxiv.org/html/2605.24052#bib.bib35),[36](https://arxiv.org/html/2605.24052#bib.bib36)\]\) further extend these ideas to the partial\-feedback bandit setting using inverse propensity scoring\. More recent work studies strategic crowdsourcing where workers may misreport to game the aggregation, including truthful peer grading\[[37](https://arxiv.org/html/2605.24052#bib.bib37)\], no\-regret incentive\-compatible online learning\[[38](https://arxiv.org/html/2605.24052#bib.bib38)\], and incentive\-aware federated bandits\[[39](https://arxiv.org/html/2605.24052#bib.bib39)\]\. However, existing strategic\-aware schemes either rely on monetary payments, focus on offline or one\-shot elicitation, or do not provide vanishing\-regret guarantees under online multi\-worker interactions with verifiable ground truth\. We later prove in Sections[IV](https://arxiv.org/html/2605.24052#S4)and[VI\-C](https://arxiv.org/html/2605.24052#S6.SS3)that none of the adaptive schemes above guarantees both truthfulness and vanishing regret under selfish workers in our setting, motivating our new mechanism design\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/fig.png)Figure 1:System model of LLM fine\-tuning for mobile crowdsourcing\. During each time slott∈\[T\]t\\in\[T\], in Stage I, each mobile worker \(e\.g\., mobile user\) first reports his preference on the pairwise responses of each prompt \(e\.g\., a traffic\-state query in navigation or a band\-availability query in spectrum sensing\)\. In Stage II, the crowdsourcing platform \(e\.g\., Waze\) aggregates workers’ feedback for fine\-tuning the LLM and updates its policy\. In Stage III, the platform adjusts each workerii’s weightwit\+1w\_\{i\}^\{t\+1\}according to his feedback accuracy for the next time slott\+1t\+1’s iteration\.
## IIISystem Model and Problem Formulation

In Section[III\-A](https://arxiv.org/html/2605.24052#S3.SS1), we introduce our system model\. In Section[III\-B](https://arxiv.org/html/2605.24052#S3.SS2), we formulate a new dynamic Bayesian game and give desired properties for guiding our late mechanism design\.

### III\-ASystem Model of Online Mobile Crowdsourcing Based on LLM Fine\-Tuning

LLM fine\-tuning is increasingly used to enhance mobile applications by producing more accurate and verifiable outputs \(e\.g\.,\[[40](https://arxiv.org/html/2605.24052#bib.bib40)\]\)\. For example, navigation applications require route explanations that correctly reflect real\-time traffic conditions \(e\.g\.,\[[41](https://arxiv.org/html/2605.24052#bib.bib41)\]\), while mobile spectrum\-sensing systems rely on channel\-usage interpretations that must align with the true occupancy state of the spectrum \(e\.g\.,\[[42](https://arxiv.org/html/2605.24052#bib.bib42)\]\)\. However, these mobile applications operate in highly dynamic environments, where traffic patterns and routing contexts can change minute by minute, and spectrum conditions and interference levels vary rapidly due to user mobility and network fluctuations\. As a result, user preferences and feedback distributions evolve over time, causing offline fine\-tuning to become quickly outdated\. This dynamic nature necessitates online or iterative LLM fine\-tuning, where the platform continuously incorporates newly collected mobile feedback to maintain reliable system performance under varying real\-world conditions\.

Based on the above facts, we consider LLM fine\-tuning for mobile crowdsourcing in which the crowdsourcing platform \(e\.g\., Waze\) receives preference feedback fromN≥2N\\geq 2mobile workers \(e\.g\., mobile users\) overTTtime slots \(e\.g\., a weekly update cycle\[[43](https://arxiv.org/html/2605.24052#bib.bib43),[44](https://arxiv.org/html/2605.24052#bib.bib44),[23](https://arxiv.org/html/2605.24052#bib.bib23)\]\)\. Each time slott∈\[T\]t\\in\[T\]consists of the following three stages \(as in Fig\.[1](https://arxiv.org/html/2605.24052#S2.F1)\)\.

1\) Stage I\. Online Worker Feedback: The platform drawsmtm\_\{t\}prompts\{xjt\}j=1mt\\\{x\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}from the context space𝒳\\mathcal\{X\}and a pair of candidate responses\{\(yljt,ylj′t∣xjt\)\}j=1mt\\\{\(y\_\{l\_\{j\}\}^\{t\},y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}for each prompt from the response space𝒴\\mathcal\{Y\}\(e\.g\.,\[[23](https://arxiv.org/html/2605.24052#bib.bib23)\]\)\. Each prompt corresponds to a task with a ground\-truth system state\. For navigation service, a prompt can be “Does live traffic data indicate congestion on Route A between 5:20 PM and 5:40 PM?”, whereyljty\_\{l\_\{j\}\}^\{t\}andylj′ty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}represent two alternative route explanations \(e\.g\., one stating that congestion is present and another stating that conditions are clear\)\. For spectrum sensing, a prompt can be “Do the measured signals indicate that the 3\.5 GHz channel at \(40\.7, \-74\.0\) is idle at timett?”, whereyljty\_\{l\_\{j\}\}^\{t\}andylj′ty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}represent two candidate channel\-occupancy interpretations \(e\.g\., one claiming the channel is idle and another claiming it is busy\)\. Such prompts have a correct answer determined by real\-world conditions, so each pair of responses has a binary ground\-truth system state\.

The platform then shares\{\(xjt,yljt,ylj′t\)\}j=1mt\\\{\(x\_\{j\}^\{t\},y\_\{l\_\{j\}\}^\{t\},y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}withNNworkers and collects their preference feedback\. Each workeri∈\[N\]i\\in\[N\]forms a continuous private preference𝒫i\(yljt≻ylj′t∣xjt\)∈\[0,1\],\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\\in\[0,1\],based on his own local observations \(e\.g\.,\[[9](https://arxiv.org/html/2605.24052#bib.bib9),[45](https://arxiv.org/html/2605.24052#bib.bib45)\]\)\. In navigation tasks, this belief may come from the worker’s real\-time mobile traffic view or recent travel experience\. In spectrum\-sensing tasks, it may come from locally measured signal strength or device\-level sensing results\. Workeriiholds the belief of the ground\-truth system state aspjt∼Bernoulli\(𝒫i\(yljt≻ylj′t∣xjt\)\),p\_\{j\}^\{t\}\\sim\\mathrm\{Bernoulli\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\),where realizationpjt=1p\_\{j\}^\{t\}=1means responseyljty\_\{l\_\{j\}\}^\{t\}is preferred overylj′ty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}andpjt=0p\_\{j\}^\{t\}=0otherwise\. Note that such location\-specific sensing context is distinctive to mobile crowdsourcing, where worker heterogeneity stems from spatial and temporal locality that the platform cannot observe directly\. This motivates our Bayesian game formulation in Section III\-B\.

Aiming to increase his long\-term influence or payment, each workeriimay report a continuous value𝒫^i\(yljt≻ylj′t∣xjt\)∈\[0,1\]\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\\in\[0,1\]that differs from his true preference𝒫i\(yljt≻ylj′t∣xjt\)\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8)\],\[[9](https://arxiv.org/html/2605.24052#bib.bib9)\]\)\. The platform and other workers are uncertain of his𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)realization\.

2\) Stage II\. Online Feedback Aggregation and Policy Update: After receiving each workerii’s reported preference values\{𝒫^i\(yljt\\\{\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′ty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}∣\\midxjt\)\}j=1mtx\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}fori∈\[N\]i\\in\[N\], the platform aggregates them using the current weightwitw\_\{i\}^\{t\}for each promptj∈\[mt\]j\\in\[m\_\{t\}\]:

𝒫\(yljt≻ylj′t∣xjt\)=∑i=1Nwit𝒫^i\(yljt≻ylj′t∣xjt\)∑i′=1Nwi′t\.\\displaystyle\\mathcal\{P\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)=\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\.\(1\)All workers begin with a uniform weightwi1=1w\_\{i\}^\{1\}=1, which corresponds to the standard LLM fine\-tuning practice of treating early feedback equally \(e\.g\.,\[[45](https://arxiv.org/html/2605.24052#bib.bib45),[46](https://arxiv.org/html/2605.24052#bib.bib46),[6](https://arxiv.org/html/2605.24052#bib.bib6)\]\)\.

The resulting aggregated preferences\{𝒫\(yljt≻ylj′t∣xjt\)\}j=1mt\\\{\\mathcal\{P\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}directly form the human\-feedback dataset𝒟t=\{\(xjt,yljt,ylj′t,𝒫\(yljt≻ylj′t∣xjt\)\)\}j=1mt\\mathcal\{D\}\_\{t\}=\\\{\(x\_\{j\}^\{t\},y\_\{l\_\{j\}\}^\{t\},y\_\{l\_\{j\}^\{\\prime\}\}^\{t\},\\mathcal\{P\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\)\\\}\_\{j=1\}^\{m\_\{t\}\}for time slottt\. Using this dataset, the platform updates the LLM policy via direct preference optimization \(DPO\), which solves the KL\-regularized objective against a reference policyπref\\pi\_\{\\texttt\{ref\}\}as follows\[[23](https://arxiv.org/html/2605.24052#bib.bib23),[45](https://arxiv.org/html/2605.24052#bib.bib45)\]:

minπt−𝔼𝒟tln⁡σ\(βln⁡πt\(y\|x\)πref\(y\|x\)−βln⁡πt\(y′\|x\)πref\(y′\|x\)\),\\displaystyle\\min\_\{\\pi\_\{t\}\}\-\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{t\}\}\\ln\\sigma\\bigg\(\\beta\\ln\\frac\{\\pi\_\{t\}\(y\|x\)\}\{\\pi\_\{\\texttt\{ref\}\}\(y\|x\)\}\-\\beta\\ln\\frac\{\\pi\_\{t\}\(y^\{\\prime\}\|x\)\}\{\\pi\_\{\\texttt\{ref\}\}\(y^\{\\prime\}\|x\)\}\\bigg\),\(2\)whereσ\(⋅\)\\sigma\(\\cdot\)denotes the logistic function andβ\\betais a parameter controlling the deviation fromπref\\pi\_\{\\texttt\{ref\}\}\. DPO is a reward\-model\-free method that optimizes the policy directly from preference data without learning an explicit reward model\. Our aggregation mechanism therefore influences the policy optimization step by determining the preference labels in𝒟t\\mathcal\{D\}\_\{t\}, while leaving the underlying DPO update procedure unchanged\. To reduce the platform\-side fine\-tuning cost, we adopt parameter\-efficient LoRA fine\-tuning \(e\.g\.,\[[47](https://arxiv.org/html/2605.24052#bib.bib47)\]\), which keeps the base LLM weights frozen and only updates low\-rank adapter matrices inserted into the attention projection layers\.

3\) Stage III\. Reweighing Workers: After deploying the updated policy, the platform verifies the ground\-truth system statepjt∈\{0,1\}p\_\{j\}^\{t\}\\in\\\{0,1\\\}for each promptj∈\[mt\]j\\in\[m\_\{t\}\]\.111Verification becomes available only at the end of each update cycle, so the delay does not block the online learning loop\. Our mechanism also degrades gracefully under bounded noisy verification of the ground\-truth system state, as formalized by Proposition[3](https://arxiv.org/html/2605.24052#Thmproposition3)in Section[V](https://arxiv.org/html/2605.24052#S5)\.In navigation applications, the platform verifies traffic conditions using authoritative infrastructure\-side data that becomes available only after a delay, such as the California DOT’s Performance Measurement System \(PeMS\)\[[48](https://arxiv.org/html/2605.24052#bib.bib48)\], which collects flow and occupancy measurements from nearly40,00040\{,\}000physical induction loop detectors embedded in the pavement\. PeMS data are operated by the state DOT and released to third\-party platforms only through delayed feeds, so they serve only as a post\-hoc verification signal for worker reweighing rather than as a real\-time substitute for worker reports\. In spectrum\-sensing applications, the fusion center likewise cannot observe primary\-user activity in real time but can verify it afterwards by decoding ACK/NACK packets\[[49](https://arxiv.org/html/2605.24052#bib.bib49),[50](https://arxiv.org/html/2605.24052#bib.bib50),[51](https://arxiv.org/html/2605.24052#bib.bib51)\]or accessing post\-transmission spectrum access system databases\[[52](https://arxiv.org/html/2605.24052#bib.bib52),[53](https://arxiv.org/html/2605.24052#bib.bib53),[54](https://arxiv.org/html/2605.24052#bib.bib54)\]\. Note that such infrastructure\-side verification is distinctive to mobile crowdsourcing and unavailable in generic crowdsourcing tasks \(e\.g\., MTurk\)\. Since it arrives only after a delay, it serves as a post\-hoc signal for worker reweighing in \([6](https://arxiv.org/html/2605.24052#S5.E6)\) rather than a real\-time substitute for worker reports\.

The platform then updates each workerii’s weight or paymentwit\+1w\_\{i\}^\{t\+1\}for the next time slot:

wit\+1=fi\(\{\{𝒫^i\(yljt≻ylj′t\|xjt\)\}j=1mt\}i=1N,\{pjt\}j=1mt\)\\displaystyle w\_\{i\}^\{t\+1\}=f\_\{i\}\(\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}\\\}\_\{i=1\}^\{N\},\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}\)\(3\)according to his reported feedback\{\{𝒫^i\(yljt\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\}j=1mt\}i=1Ny\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}\\\}\_\{i=1\}^\{N\}and ground\-truth system states\{pjt\}j=1mt\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}\.

By strategically manipulating his reported preference𝒫^i\(yljt≻ylj′t∣xjt\)\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\(e\.g\.,\[[8](https://arxiv.org/html/2605.24052#bib.bib8)\],\[[9](https://arxiv.org/html/2605.24052#bib.bib9)\]\), each workeriiaims to maximize his long\-term influence or payment from the platform, measured as his expected cumulative weight overTTtime slots \(e\.g\.,\[[55](https://arxiv.org/html/2605.24052#bib.bib55)\],\[[56](https://arxiv.org/html/2605.24052#bib.bib56)\]\):

ui\(\{\{𝒫^i\(yljt≻ylj′t\|xjt\)\}j=1mt\}t=1T\):=\\displaystyle u\_\{i\}\(\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}\\\}\_\{t=1\}^\{T\}\):=\(4\)𝔼\[∑t=1Twit\(\{\{𝒫^i\(yljt−1≻ylj′t−1\|xjt−1\)\}j=1mt−1\}i=1N,\{pjt−1\}j=1mt−1\)\]\.\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\bigg\(\\\!\\\!\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\-1\}\\\!\\\!\\succ\\\!\\\!y\_\{l\_\{j\}^\{\\prime\}\}^\{t\-1\}\|x\_\{j\}^\{t\-1\}\)\\\}\_\{j=1\}^\{m\_\{t\-1\}\}\\\}\_\{i=1\}^\{N\},\\\!\\\{p\_\{j\}^\{t\-1\}\\\}\_\{j=1\}^\{m\_\{t\-1\}\}\\\!\\\!\\bigg\)\\right\]\.Here, the expectation is taken over the ground\-truth system statespjtp\_\{j\}^\{t\}\. We adopt the natural utility of expected cumulative weight\.222Our mechanisms remain truthful when \([4](https://arxiv.org/html/2605.24052#S3.E4)\) uses any increasinggi\(⋅\)g\_\{i\}\(\\cdot\)of expected cumulative weight\. Extending to path\- or distribution\-dependent utilities \(e\.g\.,𝔼\[∑tgi\(wit\)\]\\mathbb\{E\}\[\\sum\_\{t\}g\_\{i\}\(w\_\{i\}^\{t\}\)\]\) is however inherently difficult because workers’ reports affect both current outcomes and the next weightwit\+1w\_\{i\}^\{t\+1\}, thereby changing future selection probabilities and the distribution of the entire weight trajectory\. Truthfulness would then require ruling out deviations that profit by reshaping this trajectory/distribution, which is challenging in sequential stochastic mechanisms\.

On the other hand, the platform’s alignment loss based on its aggregation overTTtime slots is given as follows:

L=∑t=1T1mt∑j=1mt\(∑i=1Nwit𝒫^i\(yljt≻ylj′t\|xjt\)∑i′=1Nwi′t−pjt\)2,\\displaystyle L=\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\-p\_\{j\}^\{t\}\\bigg\)^\{2\},which is defined as the mean square error \(MSE\) between the platform’s weighted aggregation in \([1](https://arxiv.org/html/2605.24052#S3.E1)\) and the realized binary outcome\. It wants to improve the feedback accuracy in the aggregation by assigning the largest weight to the most accurate worker at timettand such assignment will change over time\. As each worker gives feedback overTTtime slots, the ideal choice for the platform is to commit to the workeri∗i^\{\*\}incurring the least average feedback loss overTTtime slots:

i∗=arg⁡mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\.\\displaystyle i^\{\*\}=\\arg\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}\.However, the best worker is unknown in the online iteration\. The platform then turns to reducing the feedback accuracy regret between online weighted aggregation and offline choice of the best worker in hindsight as follows:

R\(T\):=\\displaystyle R\(T\):=∑t=1T1mt∑j=1mt\(∑i=1Nwit𝒫^i\(yljt≻ylj′t\|xjt\)∑i′=1Nwi′t−pjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\-p\_\{j\}^\{t\}\\bigg\)^\{2\}−mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\.\\displaystyle\-\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}\.\(5\)
Although our LLM fine\-tuning for mobile crowdsourcing and federated learning \(FL\) both involve distributed edge participants, they address fundamentally different problems and therefore require different mechanism\-design objectives\. FL needs to use all participants’ data to train a global model and view all data authentic and useful \(e\.g\.,\[[39](https://arxiv.org/html/2605.24052#bib.bib39),[57](https://arxiv.org/html/2605.24052#bib.bib57),[58](https://arxiv.org/html/2605.24052#bib.bib58)\]\)\. In contrast, our objective is to identify the best worker with the most accurate feedback over time to assign him largest weight and discard those inaccurate ones with vanishing weights\.

Note that the worker’s utility in \([4](https://arxiv.org/html/2605.24052#S3.E4)\) may not align with the platform’s objective in \([5](https://arxiv.org/html/2605.24052#S3.E5)\), leading to untruthful feedback for a large weight \(e\.g\.,\[[38](https://arxiv.org/html/2605.24052#bib.bib38)\]\)\. For example, suppose that prompt numbermt=1m\_\{t\}=1and time slot numberT=1T=1\. The platform updateswit\+1=1w\_\{i\}^\{t\+1\}=1if\|𝒫^it−pjt\|≤0\.2\|\\mathcal\{\\hat\{P\}\}\_\{i\}^\{t\}\-p\_\{j\}^\{t\}\|\\leq 0\.2,wit\+1=0\.5w\_\{i\}^\{t\+1\}=0\.5if\|𝒫^it−pjt\|∈\(0\.2,0\.5\]\|\\mathcal\{\\hat\{P\}\}\_\{i\}^\{t\}\-p\_\{j\}^\{t\}\|\\in\(0\.2,0\.5\], andwit\+1=0w\_\{i\}^\{t\+1\}=0otherwise\. A workeriiholding𝒫it=0\.6\\mathcal\{P\}\_\{i\}^\{t\}=0\.6obtains an expected weight of0\.30\.3in total by truthfully reporting\. However, he can obtain an expected weight of0\.60\.6by misreporting any𝒫^it≥0\.8\\mathcal\{\\hat\{P\}\}\_\{i\}^\{t\}\\geq 0\.8, increased from being honest\. Therefore, it is crucial for the platform to properly design the weight update function for any worker’s truthful feedback and a small regret\.

Remark \(Communication and computation overhead\)\.LLM fine\-tuning is performed centrally at the platform; workers only upload scalar preference reports\{𝒫^i\(yljt≻ylj′t\|xjt\)\}j=1mt\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}per slot\. In typical mobile crowdsourcing settings,mtm\_\{t\}andNNare on the order of tens to a hundred\. ForN=50N=50andmt=20m\_\{t\}=20, each worker’s per\-slot uplink is4mt=804m\_\{t\}=80bytes, totaling∼4\\sim 4KB at the platform; no model parameters, adapters, or gradients are transmitted\. The per\-slot aggregation in \([1](https://arxiv.org/html/2605.24052#S3.E1)\) and weight update in \([6](https://arxiv.org/html/2605.24052#S5.E6)\) introduced by our mechanism cost\(2N\+1\)mt\+2Nmt\+N≈4000\(2N\+1\)m\_\{t\}\+2Nm\_\{t\}\+N\\approx 4000floating\-point operations\. The dominant cost is the platform\-side LoRA fine\-tuning, which is independent of our mechanism and trains only a small fraction of the full LLM \(e\.g\.,0\.24%0\.24\\%for GPT\-2 124M\), keeping GPT\-2 fine\-tuning tractable on a single NVIDIA A100 GPU\. Workers perform no on\-device LLM computation\.

### III\-BDynamic Bayesian Game Formulation for RLHF

Based on our system model above, we formulate the multi\-agent online learning between the crowdsourcing platform andNNstrategic workers as a new dynamic Bayesian game:

- •In Stage I of each time slottt∈\\in\[T\]\[T\], each workeriiwith its private preference\{𝒫i\(yljt\\\{\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\}j=1mty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}determines his feedback\{𝒫^i\(yljt\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\}j=1mty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}\(may not be the truth\) to maximize his utility in \([4](https://arxiv.org/html/2605.24052#S3.E4)\)\.
- •In Stage III of each time slottt∈\\in\[T\]\[T\], the platform updates each worker’s weightwit\+1w\_\{i\}^\{t\+1\}=fi\(\{\{𝒫^i\(yljt≻ylj′t\|xjt\)\}j=1mt\}i=1N,\{pjt\}j=1mt\)f\_\{i\}\(\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}\\\}\_\{i=1\}^\{N\},\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}\)for reducing regret in \([5](https://arxiv.org/html/2605.24052#S3.E5)\)\.

Note that there is no strategic decision for any worker or the platform in Stage II\. We need to carefully design an online aggregation mechanism for ensuring each worker’s truthful feedback and a vanishing regret over time\. We define the desired properties as below\.

###### Definition 1\(Truthfulness of Worker Feedback\)

An online weighted aggregation mechanismℳ\\mathcal\{M\}is truthful if each workerii∈\\in\[N\]\[N\]obtains a larger long\-term influence or payment in \([4](https://arxiv.org/html/2605.24052#S3.E4)\) over the wholeTTtime slots through truthful feedback instead of misreporting in the meantime, i\.e\.,

𝔼\[∑t=1Twit\(\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\big\(\{𝒫i\(yljt−1≻ylj′t−1\|xjt−1\)\}j=1mt−1,\{pjt−1\}j=1mt−1,\\displaystyle\\\{\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\-1\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\-1\}\|x\_\{j\}^\{t\-1\}\)\\\}\_\{j=1\}^\{m\_\{t\-1\}\},\\\{p\_\{j\}^\{t\-1\}\\\}\_\{j=1\}^\{m\_\{t\-1\}\},\{\{𝒫^k\(yljt−1≻ylj′t−1\|xjt−1\)\}j=1mt−1\}k=1,k≠iN\)\]\\displaystyle\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{k\}\(y\_\{l\_\{j\}\}^\{t\-1\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\-1\}\|x\_\{j\}^\{t\-1\}\)\\\}\_\{j=1\}^\{m\_\{t\-1\}\}\\\}\_\{k=1,k\\neq i\}^\{N\}\\big\)\\bigg\]≥𝔼\[∑t=1Twit\(\\displaystyle\\geq\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\big\(\{𝒫^i\(yljt−1≻ylj′t−1\|xjt−1\)\}j=1mt−1,\{pjt−1\}j=1mt−1,\\displaystyle\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\-1\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\-1\}\|x\_\{j\}^\{t\-1\}\)\\\}\_\{j=1\}^\{m\_\{t\-1\}\},\\\{p\_\{j\}^\{t\-1\}\\\}\_\{j=1\}^\{m\_\{t\-1\}\},\{\{𝒫^k\(yljt−1≻ylj′t−1\|xjt−1\)\}j=1mt−1\}k=1,k≠iN\)\]\.\\displaystyle\\\{\\\{\\mathcal\{\\hat\{P\}\}\_\{k\}\(y\_\{l\_\{j\}\}^\{t\-1\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\-1\}\|x\_\{j\}^\{t\-1\}\)\\\}\_\{j=1\}^\{m\_\{t\-1\}\}\\\}\_\{k=1,k\\neq i\}^\{N\}\\big\)\\bigg\]\.

###### Definition 2\(High Efficiency in Sublinear RegretR\(T\)R\(T\)in \([5](https://arxiv.org/html/2605.24052#S3.E5)\)\)

An online weighted aggregation mechanismℳ\\mathcal\{M\}is efficient if its time\-average regretRℳ\(T\)/TR\_\{\\mathcal\{M\}\}\(T\)/Tis vanishing in the time slot numberTT, i\.e\.,limT→∞Rℳ\(T\)T=0\.\\lim\_\{T\\to\\infty\}\\frac\{R\_\{\\mathcal\{M\}\}\(T\)\}\{T\}=0\.

## IVBenchmark Schemes

In this section, we analyze three recent adaptive weighting schemes from the crowdsourcing and algorithmic game theory literature, serving as fair benchmarks to compare against later\.

### IV\-ABenchmark 1: EM\-based Weight Estimation Scheme

In the crowdsourcing literature, EM\-based weight estimation \(e\.g\.,\[[15](https://arxiv.org/html/2605.24052#bib.bib15),[16](https://arxiv.org/html/2605.24052#bib.bib16)\]\) treats the true outcomepjtp\_\{j\}^\{t\}as a hidden binary variable and iteratively estimates both the worker weightswitw\_\{i\}^\{t\}and the most likely outcome via Expectation\-Maximization\. We consider EM instantiations commonly used in crowdsourcing truth inference \(e\.g\., Gaussian or Dawid–Skene–type models\), where the E\-step yields a monotone, majority\-consistent estimate of the latent variable and the M\-step rewards proximity to that estimate\. Unfortunately, this scheme is not truthful and incurs a non\-vanishing regret\.

###### Lemma 1

The benchmark 1 of EM\-based weight estimation scheme is not truthful and incurs a regret in \([5](https://arxiv.org/html/2605.24052#S3.E5)\) asR1\(T\)=𝒪\(T\)R\_\{1\}\(T\)=\\mathcal\{O\}\(T\), leading to a non\-vanishing time\-average regretlimT→∞R1\(T\)T\>0\\lim\_\{T\\to\\infty\}\\frac\{R\_\{1\}\(T\)\}\{T\}\>0\.

The proof is given in Appendix A of the supplementary material\. Since EM relies on statistical consistency of worker reports, strategic workers can inflate their apparent reliability by aligning with the majority, earning disproportionate weights without accurate feedback\.

### IV\-BBenchmark 2: Hedge Scheme

The Hedge scheme \(e\.g\.,\[[17](https://arxiv.org/html/2605.24052#bib.bib17),[18](https://arxiv.org/html/2605.24052#bib.bib18)\]\) updates each workerii’s weight by exponential decay on the squared feedback loss:

wit\+1=wit⋅e−η⋅1mt∑j=1mt\(𝒫^i\(yljt≻ylj′t\|xjt\)−pjt\)2,\\displaystyle w\_\{i\}^\{t\+1\}=w\_\{i\}^\{t\}\\cdot e^\{\-\\eta\\cdot\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\},fori∈\[N\]i\\in\[N\]andt∈\[T\]t\\in\[T\], whereη\\etais the learning rate\. Unfortunately, Hedge is not truthful as shown below\.

###### Lemma 2

Benchmark 2 of Hedge scheme is not truthful\.

The proof is given in Appendix B of the supplementary material\. As a worker with a large private preference can earn a higher weight by exaggerating his belief rather than reporting truthfully, Hedge fails to guarantee his truthful reporting\.

### IV\-CBenchmark 3: Median Aggregation Scheme

In the algorithmic game theory literature, the “median” scheme is widely used to incentivize truthful reporting from selfish agents \(e\.g\.,\[[11](https://arxiv.org/html/2605.24052#bib.bib11)\],\[[27](https://arxiv.org/html/2605.24052#bib.bib27)\]\)\.

###### Definition 3\(Median Aggregation Scheme\)

The platform first re\-organizes workers’ feedback\{𝒫^i\(yljt≻ylj′t\|xjt\)\}i=1N\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{i=1\}^\{N\}in an increasing order as𝒫^k1,jt≤⋯≤𝒫^kN,jt\\mathcal\{\\hat\{P\}\}\_\{k\_\{1\},j\}^\{t\}\\leq\\cdots\\leq\\mathcal\{\\hat\{P\}\}\_\{k\_\{N\},j\}^\{t\}for each promptj∈\[mt\]j\\in\[m\_\{t\}\]in each time slott∈\[T\]t\\in\[T\]\. It then chooses the median𝒫^ks,jt\\mathcal\{\\hat\{P\}\}\_\{k\_\{s\},j\}^\{t\}as its aggregation, where the indexs=N2s=\\frac\{N\}\{2\}ifNNis even ands=N\+12s=\\frac\{N\+1\}\{2\}otherwise\.

Yet, this scheme still incurs non\-vanishing regret\.

###### Lemma 3

The platform’s regret in \([5](https://arxiv.org/html/2605.24052#S3.E5)\) under the benchmark 3 of the median scheme isR3\(T\)=𝒪\(T\)R\_\{3\}\(T\)=\\mathcal\{O\}\(T\), leading to a non\-vanishing time\-average regretlimT→∞R3\(T\)T\>0\\lim\_\{T\\to\\infty\}\\frac\{R\_\{3\}\(T\)\}\{T\}\>0\.

The proof is given in Appendix C of the supplementary material\. Even under truthful reporting, the median fails to fully weight the most accurate workeroowith𝒫o\(yljt≻ylj′t\|xjt\)=pjt\\mathcal\{P\}\_\{o\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=p\_\{j\}^\{t\}, yielding an𝒪\(T\)\\mathcal\{O\}\(T\)aggregation loss while the best worker in hindsight incurs zero loss\. These non\-vanishing regrets across all three benchmarks motivate our truthful mechanism below\.

## VTruthful Online Weighted Aggregation Mechanism Design and Analysis

As benchmarks 1\-3 with untruthful worker feedback fail to identify the most accurate worker over time, we are well motivated to incentivize each worker’s truthfulness and dynamically adjust each worker’s weight according to his feedback accuracy in each time slot\. In Stage III of each time slot, we assign a larger weight \(compared to the others\) if a worker’s prior feedback is closer to the realized binary outcome\. We need to carefully design our online mechanism weightage in \([3](https://arxiv.org/html/2605.24052#S3.E3)\) to ensure that each obtains the largest long\-term reputation in Definition[1](https://arxiv.org/html/2605.24052#Thmdefinition1)only with truthful feedback\. We define it in the following\.

###### Definition 4\(Online Weighted Aggregation Mechanism\)

At Stage III of each time slott∈\[T\]t\\in\[T\], the platform updates each worker’s weightwit\+1w\_\{i\}^\{t\+1\}in \([3](https://arxiv.org/html/2605.24052#S3.E3)\) based on his feedback𝒫^i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)and the realized binary outcomepjtp\_\{j\}^\{t\}:

wit\+1=wit⋅\(1−α1mt∑j=1mt\(𝒫^i\(yljt≻ylj′t\|xjt\)−pjt\)2\),\\displaystyle w\_\{i\}^\{t\+1\}=w\_\{i\}^\{t\}\\cdot\\bigg\(1\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}\\bigg\),\(6\)whereα\>0\\alpha\>0is the step\-size parameter to be determined later\.333Althoughα\\alphais uniform across workers, our mechanism remains responsive to new high\-quality workers, as formalized by Proposition[2](https://arxiv.org/html/2605.24052#Thmproposition2)in Section[V](https://arxiv.org/html/2605.24052#S5)\.

Intuitively, our mechanism determines each worker’s weight in time slott\+1t\+1based on his feedback accuracy in the previous time slottt\. If the squared difference between his feedback and the realized binary outcome\(𝒫^i\(yljt≻ylj′t\|xjt\)−pjt\)2\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}is small, his weightwit\+1w\_\{i\}^\{t\+1\}will be only reduced by a small value fromwitw\_\{i\}^\{t\}\. Though all workers’ weights are decreasing over time, we care about the relative weighted aggregation as in \([1](https://arxiv.org/html/2605.24052#S3.E1)\) and the worker with a small decrement has a large influence in the platform’s aggregation\. Our mechanism satisfies the truthful property as shown below\.

###### Proposition 1

Our mechanism in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)is truthful, i\.e\.,𝒫^i∗\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}^\{\*\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)for any promptj∈\[mt\]j\\in\[m\_\{t\}\], SUi∈\[N\]i\\in\[N\]and time slott∈\[T\]t\\in\[T\]\.

The proof is given in Appendix D of the supplementary material of this TMC submission\. As each worker holds a Bernoulli belief on outcomepjtp\_\{j\}^\{t\}, any deviation from truthful feedback leads to a strictly lower weight in any time slottt\. Thus, no worker has the incentive to misreport and their truthfulness is guaranteed\. Further, our mechanism is efficient and incurs a vanishing time\-average regret inTT\.

###### Theorem 1

Our online weighted aggregation mechanism in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)incurs a sublinear regretRℳ\(T\)R\_\{\\mathcal\{M\}\}\(T\)=𝒪\(T12\)\\mathcal\{O\}\(T^\{\\frac\{1\}\{2\}\}\)by choosing the step sizeα\\alphain \([6](https://arxiv.org/html/2605.24052#S5.E6)\) as

α=232ln⁡NT,\\alpha=\\frac\{2\}\{3\}\\sqrt\{\\frac\{2\\ln N\}\{T\}\},leading to zero time\-average regret withlimT→∞Rℳ\(T\)T=0\\lim\_\{T\\to\\infty\}\\frac\{R\_\{\\mathcal\{M\}\}\(T\)\}\{T\}=0\.

Proof\.According to Proposition[1](https://arxiv.org/html/2605.24052#Thmproposition1), we have𝒫^i\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)for allj∈\[mt\]j\\in\[m\_\{t\}\],i∈\[N\]i\\in\[N\]andt∈\[T\]t\\in\[T\]\. To derive a lower bound onln⁡∑i=1NwiT\+1∑i=1Nwi1\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{1\}\}, we have

ln⁡∑i=1NwiT\+1∑i=1Nwi1\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{1\}\}=\\displaystyle=ln⁡\(∑i=1NwiT\+1\)−ln⁡\(∑i=1Nwi1\)\\displaystyle\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\\bigg\)\-\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{1\}\\bigg\)=\\displaystyle=ln⁡\(∑i=1N∏t=1T\(1−α1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\)\)−ln⁡N\\displaystyle\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}\\prod\_\{t=1\}^\{T\}\(1\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\)\\bigg\)\\\!\\\!\-\\\!\\\!\\ln N≥\\displaystyle\\geqln⁡\(∏t=1T\(1−α1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\)\)−ln⁡N\\displaystyle\\ln\\bigg\(\\prod\_\{t=1\}^\{T\}\(1\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\)\\bigg\)\-\\ln N=\\displaystyle=∑t=1Tln⁡\(1−α1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\)−ln⁡N\\displaystyle\\sum\_\{t=1\}^\{T\}\\ln\\bigg\(1\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\\bigg\)\-\\ln N≥\\displaystyle\\geq−α∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}−α2∑t=1T\(1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\)2−ln⁡N\\displaystyle\-\\alpha^\{2\}\\sum\_\{t=1\}^\{T\}\\bigg\(\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\\bigg\)^\{2\}\-\\ln N≥\\displaystyle\\geq−α∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2−α2T−ln⁡N,\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}\-\\alpha^\{2\}T\-\\ln N,\(7\)where we chooseα<12\\alpha<\\frac\{1\}\{2\}and denotei∗i^\{\*\}as the best worker in hindsight\. The first and the third inequalities hold due to0<α<120<\\alpha<\\frac\{1\}\{2\}and0≤\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2≤10\\leq\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\\leq 1for alli∈\[N\]i\\in\[N\]andt∈\[T\]t\\in\[T\]\. The second inequality holds due toln⁡\(1−x\)≥−x−x2\\ln\(1\-x\)\\geq\-x\-x^\{2\}forx≤12x\\leq\\frac\{1\}\{2\}\.

To derive an upper bound onln⁡∑i=1Nwit\+1∑i=1Nwit\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\}, we have

ln⁡∑i=1Nwit\+1∑i=1Nwit\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\}=\\displaystyle=ln⁡\(∑i=1Nwit⋅\(1−α1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\)∑i′=1Nwi′t\)\\displaystyle\\ln\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\\cdot\(1\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\\bigg\)≤\\displaystyle\\leqln⁡\(∑i=1Nwit⋅e−α1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\)\\displaystyle\\ln\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\\cdot e^\{\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\\bigg\)≤\\displaystyle\\leq−α1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\+α28,\\displaystyle\\\!\\\!\-\\alpha\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{\\alpha^\{2\}\}\{8\},\(8\)where the first inequality holds due to1−αx≤e−αx1\-\\alpha x\\leq e^\{\-\\alpha x\}for0≤x≤10\\leq x\\leq 1andα\>0\\alpha\>0, the second due to Hoeffding’s lemma: for a random variableX=−1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∈\[−1,0\]X=\-\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\\in\[\-1,0\]andα∈R\\alpha\\in R, we have

ln⁡\(𝔼\[eαX\]\)≤α𝔼\[X\]\+α2\(1−0\)28\.\\displaystyle\\ln\(\\mathbb\{E\}\[e^\{\\alpha X\}\]\)\\leq\\alpha\\mathbb\{E\}\[X\]\+\\frac\{\\alpha^\{2\}\(1\-0\)^\{2\}\}\{8\}\.According to \([8](https://arxiv.org/html/2605.24052#S5.E8)\), we have

ln⁡∑i=1NwiT\+1∑i=1Nwi1\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{1\}\}=\\displaystyle=ln⁡\(∑i=1NwiT\+1∑i=1Nwit∑i=1Nwit∑i=1Nwit−1⋅⋯⋅∑i=1Nwi2∑i=1Nwi1\)\\displaystyle\\ln\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\-1\}\}\\cdot\\cdots\\cdot\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{2\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{1\}\}\\bigg\)=\\displaystyle=∑t=1Tln⁡∑i=1NwiT\+1∑i=1Nwit\\displaystyle\\sum\_\{t=1\}^\{T\}\\ln\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\}≤\\displaystyle\\leq−α∑t=1T1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\+α2T8\.\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{\\alpha^\{2\}T\}\{8\}\.\(9\)According to \([7](https://arxiv.org/html/2605.24052#S5.E7)\) and \([9](https://arxiv.org/html/2605.24052#S5.E9)\), we have

−α∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2−α2T−ln⁡N\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}\-\\alpha^\{2\}T\-\\ln N≤\\displaystyle\\leq−α∑t=1T1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\+α2T8\.\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{\\alpha^\{2\}T\}\{8\}\.After re\-arranging the above inequalities and dividingα\\alphaon both sides, we have

∑t=1T1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}−∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2≤ln⁡Nα\+9Tα8\.\\displaystyle\-\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}\\leq\\frac\{\\ln N\}\{\\alpha\}\+\\frac\{9T\\alpha\}\{8\}\.Choosingα=232ln⁡NT<12\\alpha=\\frac\{2\}\{3\}\\sqrt\{\\frac\{2\\ln N\}\{T\}\}<\\frac\{1\}\{2\}\(true asT→∞T\\to\\infty\), we have

∑t=1T1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}−∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2≤3Tln⁡N2\.\\displaystyle\-\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}\\leq 3\\sqrt\{\\frac\{T\\ln N\}\{2\}\}\.Finally, we have the regretRℳ\(T\)R\_\{\\mathcal\{M\}\}\(T\)satisfying

Rℳ\(T\)=\\displaystyle R\_\{\\mathcal\{M\}\}\(T\)=∑t=1T1mt∑j=1mt\(∑i=1Nwit𝒫^i\(yljt≻ylj′t\|xjt\)∑i′=1Nwi′t−pjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\-p\_\{j\}^\{t\}\\bigg\)^\{2\}−∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\\displaystyle\-\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}≤\\displaystyle\\leq∑t=1T1mt∑j=1mt∑i=1Nwit\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2∑i′=1Nwi′t\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}−∑t=1T1mt∑j=1mt\(𝒫i∗\(yljt≻ylj′t\|xjt\)−pjt\)2\\displaystyle\-\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i^\{\*\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}≤\\displaystyle\\leq3Tln⁡N2=𝒪\(T12\),\\displaystyle 3\\sqrt\{\\frac\{T\\ln N\}\{2\}\}=\\mathcal\{O\}\(T^\{\\frac\{1\}\{2\}\}\),where the first inequality holds due to the convexity of the aggregation loss function\. We then finish the proof\.□\\hfill\\square

According to Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1), our mechanism obviously improves from benchmarks 1\-3 by distinguishing the most accurate worker in the online learning process asT→∞T\\to\\infty\. AsNNincreases, the platform may find a more accurate worker in hindsight\. Thus, it chooses a larger step\-sizeα\\alphain \([6](https://arxiv.org/html/2605.24052#S5.E6)\) to punish inaccurate workers more in the weighted aggregation to retire them\. AsTTincreases, the platform is more patient in choosing a smallerα\\alphain \([6](https://arxiv.org/html/2605.24052#S5.E6)\) to select the best worker in hindsight with more time slots and samples\.

We further strengthen Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)with two additional properties of our mechanism\. Proposition[2](https://arxiv.org/html/2605.24052#Thmproposition2)below characterizes the responsiveness of our mechanism to new high\-quality workers under the uniform step\-sizeα\\alpha\. Proposition[3](https://arxiv.org/html/2605.24052#Thmproposition3)below establishes the robustness of our mechanism against noisy verification of the ground\-truth system state\.

###### Proposition 2\(Responsiveness to a newly\-arriving high\-quality worker under uniformα\\alpha\)

Consider an existing workerk∈\[N\]k\\in\[N\]active prior to slott0∈\[T\]t\_\{0\}\\in\[T\]with weightwkt0∈\(0,∞\)w\_\{k\}^\{t\_\{0\}\}\\in\(0,\\infty\)accumulated through prior reweighing, and a newly\-arriving workeriientering the system at slott0t\_\{0\}with initial weightwit0∈\(0,wkt0\)w\_\{i\}^\{t\_\{0\}\}\\in\(0,w\_\{k\}^\{t\_\{0\}\}\), so that the newly\-arriving worker starts at a strict weight disadvantagewit0<wkt0w\_\{i\}^\{t\_\{0\}\}<w\_\{k\}^\{t\_\{0\}\}\. Both workers continue to participate in the mechanism from slott0t\_\{0\}onward under the weight update \([6](https://arxiv.org/html/2605.24052#S5.E6)\)\. Suppose under truthful reporting, the newly\-arriving worker is strictly more accurate, with expected per\-slot squared losses

ℓit:=𝔼\[1mt∑j=1mt\(𝒫i\(yljt≻ylj′t∣xjt\)−pjt\)2\],\\displaystyle\\ell\_\{i\}^\{t\}:=\\mathbb\{E\}\\bigg\[\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}\\bigg\],satisfyingℓit<ℓkt\\ell\_\{i\}^\{t\}<\\ell\_\{k\}^\{t\}for allt≥t0t\\geq t\_\{0\}and accuracy gapΔ:=mint≥t0⁡\(ℓkt−ℓit\)\>0\\Delta:=\\min\_\{t\\geq t\_\{0\}\}\(\\ell\_\{k\}^\{t\}\-\\ell\_\{i\}^\{t\}\)\>0\. Under our mechanism in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)with step\-sizeα=232ln⁡N/T\\alpha=\\frac\{2\}\{3\}\\sqrt\{2\\ln N/T\}from Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1), the newly\-arriving worker overtakes the existing worker in expected weight within

τnew≤⌈ln⁡\(wkt0/wit0\)αΔ⌉=𝒪\(Tln⁡N⋅Δ−1\)\\displaystyle\{\\tau\_\{\\textnormal\{new\}\}\\leq\\bigg\\lceil\\frac\{\\ln\(w\_\{k\}^\{t\_\{0\}\}/w\_\{i\}^\{t\_\{0\}\}\)\}\{\\alpha\\Delta\}\\bigg\\rceil=\\mathcal\{O\}\\bigg\(\\sqrt\{\\frac\{T\}\{\\ln N\}\}\\cdot\\Delta^\{\-1\}\\bigg\)\}slots after arrival, regardless of how the existing worker accumulated its weight prior to slott0t\_\{0\}\.

The proof is given in Appendix H of the supplementary material of this TMC submission\. Proposition[2](https://arxiv.org/html/2605.24052#Thmproposition2)shows that our mechanism adapts quickly to a newly\-arriving high\-quality worker even when an existing worker enters the comparison with a substantially higher accumulated weight\. The bound’s numeratorln⁡\(wkt0/wit0\)\\ln\(w\_\{k\}^\{t\_\{0\}\}/w\_\{i\}^\{t\_\{0\}\}\)scales only logarithmically with the weight asymmetry, so even a large prior weight advantage of the existing worker is overcome in𝒪\(T/ln⁡N⋅Δ−1\)\\mathcal\{O\}\(\\sqrt\{T/\\ln N\}\\cdot\\Delta^\{\-1\}\)slots, independent of the existing worker’s tenure prior tot0t\_\{0\}\. By symmetry, the same bound applies whenever any worker becomes more accurate than a previously dominant worker, regardless of how the dominant worker accumulated its weight from prior reputation or earlier accurate reporting\. This responsiveness is empirically verified in Figs\.[3](https://arxiv.org/html/2605.24052#S7.F3)and[5](https://arxiv.org/html/2605.24052#S7.F5), where the most accurate worker’s chosen probability rises from the uniform initialization of1/N1/Nto over0\.90\.9\(full feedback\) and near0\.80\.8\(limited feedback\) within the time horizon\.

###### Proposition 3\(Robustness under noisy verification\)

Suppose the verified ground\-truth labelp~jt\\tilde\{p\}\_\{j\}^\{t\}differs from the truepjtp\_\{j\}^\{t\}with probability at mostϵ∈\[0,1/2\)\\epsilon\\in\[0,1/2\), independently across prompts and independently of all workers’ reports\. Under our mechanism in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)with step\-sizeα=232ln⁡N/T\\alpha=\\frac\{2\}\{3\}\\sqrt\{2\\ln N/T\}from Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1), the following hold:

1. \(a\)\(Truthfulness degradation\.\) The best\-response misreport𝒫^i∗\\hat\{\\mathcal\{P\}\}\_\{i\}^\{\*\}of any workeriideviates from his true preference𝒫i\\mathcal\{P\}\_\{i\}by at mostϵ\\epsilon, i\.e\.,\|𝒫^i∗−𝒫i\|≤ϵ\|\\hat\{\\mathcal\{P\}\}\_\{i\}^\{\*\}\-\\mathcal\{P\}\_\{i\}\|\\leq\\epsilon\. The cumulative strategic gain from misreporting overTTslots is at most𝒪\(ϵ2T\)\\mathcal\{O\}\(\\epsilon^\{2\}\\sqrt\{T\}\), which is dominated by the𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret term and vanishes asϵ→0\\epsilon\\to 0\.
2. \(b\)\(Regret degradation\.\) The expected time\-average regret satisfies 𝔼\[Rℳ\(T\)\]T≤𝒪\(1T\)\+2ϵ,\\displaystyle\\frac\{\\mathbb\{E\}\[R\_\{\\mathcal\{M\}\}\(T\)\]\}\{T\}\\leq\\mathcal\{O\}\\bigg\(\\frac\{1\}\{\\sqrt\{T\}\}\\bigg\)\+2\\epsilon,recovering the clean\-case bound in Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)asϵ→0\\epsilon\\to 0\.

The proof is given in Appendix I of the supplementary material of this TMC submission\. Proposition[3](https://arxiv.org/html/2605.24052#Thmproposition3)extends Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)to settings where the infrastructure\-side verification is imperfect, such as when decoded ACK/NACK packets in spectrum sensing are occasionally corrupted or when PeMS traffic flow measurements in navigation exhibit occasional sensor faults\. The analogous result for the limited\-feedback setting in Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2)can be derived similarly, with an additive𝒪\(ϵ\)\\mathcal\{O\}\(\\epsilon\)term added to the time\-average regret\.

## VIExtension to Limited Worker Feedback

Recall that in Sections[III](https://arxiv.org/html/2605.24052#S3)\-[V](https://arxiv.org/html/2605.24052#S5), we assume that the platform has access to all the workers’ feedback per time slot\. In practice, collecting feedback from multiple workers can be difficult due to cost and coordination challenges unique to mobile devices \(such as battery, sensing, and uplink constraints that are absent in desktop\-based crowdsourcing platforms\), which can in turn slow down the iterative LLM fine\-tuning \(e\.g\.,\[[28](https://arxiv.org/html/2605.24052#bib.bib28)\]\)\. For example, in navigation applications, querying many workers about real\-time traffic events \(e\.g\., whether a reported accident or congestion is actually present\) increases monetary cost and delays the update cycle\. Similarly, in spectrum\-sensing applications, asking multiple mobile users to label channel conditions \(e\.g\., whether a 3\.5 GHz channel is idle or occupied at a specific time and location\) may require additional sensing operations and energy consumption, which limits the number of workers that can provide feedback\.

In this section, we extend to consider a challenging case where the platform can receive only one worker’s report transmission per time slot\. In the following, we first present our system model for this limited worker feedback scenario and the dynamic Bayesian game formulation\. We then give our mechanism design and analysis\.

### VI\-ASystem Model of Limited Worker Feedback

Similar to the system model in Section[III\-A](https://arxiv.org/html/2605.24052#S3.SS1), the platform iterates the online learning process inTTtime slots, where each time slottt∈\\in\[T\]\[T\]still contains three stages\. In Stage I, instead of querying all theNNworkers’ feedback, the platform can only select one workerIt∈\[N\]I\_\{t\}\\in\[N\]in each time slotttfor his local observations and preference feedback\. We consider that the platform uses a mixed strategy to select each workeriiwith a probability ofwit∑i′∈\[N\]wi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}\\in\[N\]\}w\_\{i^\{\\prime\}\}^\{t\}\}in each time slottt\. In Stage II, after receiving the chosen worker’s feedback\{𝒫^It\(yljt≻ylj′t\|xjt\)\}j=1mt\\\{\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}, the platform determines the preference for each promptj∈\[mt\]j\\in\[m\_\{t\}\]as follows:

𝒫\(yljt≻ylj′t\|xjt\)=𝒫^It\(yljt≻ylj′t\|xjt\),\\displaystyle\\mathcal\{P\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\),\(10\)which will be included to construct the human\-annotation dataset\{𝒫\(yljt≻ylj′t\|xjt\)\}j=1mt\\\{\\mathcal\{P\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}for training and updating the LLM fine\-tuning policy later\. In Stage III, the platform dynamically adjusts each worker’s weight in the online learning process and determines

wit\+1=fi\(\{𝒫^It\(yljt≻ylj′t\|xjt\)\}j=1mt,\{pjt\}j=1mt\)\\displaystyle w\_\{i\}^\{t\+1\}=f\_\{i\}\(\\\{\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\},\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}\)\(11\)for the next slott\+1t\+1’s selection according to feedback\{𝒫^It\(yljt≻ylj′t\|xjt\)\}j=1mt\\\{\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}and the realized ground truth physical state\{pjt\}j=1mt\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}, wherewi1=1w\_\{i\}^\{1\}=1for anyi∈\[N\]i\\in\[N\]\.

By strategically manipulating his reported preference𝒫^i\(yljt≻ylj′t∣xjt\)\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\), each workeriiaims to maximize his long\-term reputation or payment from the platform over wholeTTtime slots as in \([4](https://arxiv.org/html/2605.24052#S3.E4)\)\.

On the other hand, the platform adopts a mixed strategy to choose each workeri∈\[N\]i\\in\[N\]with a probability ofwit∑i′∈\[N\]wi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}\\in\[N\]\}w\_\{i^\{\\prime\}\}^\{t\}\}in each time slottt\. It wants to improve the feedback accuracy in the aggregation by assigning the largest weight to the most accurate worker\. As the best worker is still unknown in the online iteration, it then aims to reduce the regret between online mixed selection and offline choice of the best worker in hindsight, where the alignment loss is defined as the MSE between the platform’s mixed selection and the realized binary ground truth physical state as follows:

R\(T\):=\\displaystyle R\(T\):=∑t=1T∑i=1Nwit∑i′=1Nwi′t1mt∑j=1mt\(𝒫^i\(yljt≻ylj′t\|xjt\)−pjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}−mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2,\\displaystyle\-\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\},:=\\displaystyle:=∑t=1T∑i=1Nwit∑i′=1Nwi′tℓ^it−mini∈\[N\]∑t=1Tℓit,\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\\hat\{\\ell\}\_\{i\}^\{t\}\-\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\ell\_\{i\}^\{t\},\(12\)where we defineℓ^it:=1mt∑j=1mt\(𝒫^i\(yljt≻ylj′t\|xjt\)−pjt\)2\\hat\{\\ell\}\_\{i\}^\{t\}:=\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}andℓit:=1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\\ell\_\{i\}^\{t\}:=\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\big\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\big\)^\{2\}\.

### VI\-BDynamic Bayesian Game Formulation under Limited Worker Feedback

Based on our system model above, we formulate the multi\-agent online learning as a new dynamic Bayesian game:

- •In Stage I of each time slottt∈\\in\[T\]\[T\], the platform first chooses a workerIt∈\[N\]I\_\{t\}\\in\[N\]for his feedback\. Then, the chosen workerItI\_\{t\}with his private preference\{𝒫It\(yljt\\\{\\mathcal\{P\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\}j=1mty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}determines his feedback\{𝒫^It\(yljt\\\{\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\}j=1mty\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\}to maximize his utility in \([4](https://arxiv.org/html/2605.24052#S3.E4)\)\.
- •In Stage III of each time slottt∈\\in\[T\]\[T\], the platform updates each worker’s weightwit\+1w\_\{i\}^\{t\+1\}=fi\(\{𝒫^It\(yljt≻ylj′t\|xjt\)\}j=1mt,\{pjt\}j=1mt\)f\_\{i\}\(\\\{\\mathcal\{\\hat\{P\}\}\_\{I\_\{t\}\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{j=1\}^\{m\_\{t\}\},\\\{p\_\{j\}^\{t\}\\\}\_\{j=1\}^\{m\_\{t\}\}\)for reducing its regret in \([12](https://arxiv.org/html/2605.24052#S6.E12)\)\.

Note that there is no strategic decision for any worker or the platform in Stage II\. We need to carefully design an online mixed selection mechanism for ensuring each worker’s truthful feedback and a vanishing regret over time as given in Definitions[1](https://arxiv.org/html/2605.24052#Thmdefinition1)and[2](https://arxiv.org/html/2605.24052#Thmdefinition2)\. Before that, we introduce the EXP3 scheme in the crowdsourcing literature to check\.

### VI\-CBenchmark 4: EXP3 for Limited Worker Feedback

The EXP3 scheme as a variant of Hedge in the crowdsourcing literature \(e\.g\.,\[[35](https://arxiv.org/html/2605.24052#bib.bib35)\],\[[36](https://arxiv.org/html/2605.24052#bib.bib36)\]\) for limited worker feedback updates workerii’s weight as follows:

wit\+1=\{wit⋅e−η⋅ℓ~it,ifi=It,wit,otherwise,\\displaystyle w\_\{i\}^\{t\+1\}=\\begin\{cases\}w\_\{i\}^\{t\}\\cdot e^\{\-\\eta\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\},&\\text\{if\}\\ i=I\_\{t\},\\\\ w\_\{i\}^\{t\},&\\text\{otherwise,\}\\end\{cases\}where an unbiased estimatorℓ~it\\tilde\{\\ell\}\_\{i\}^\{t\}of the workerii’s feedback loss is given as follows:

ℓ~it=\{ℓit\(1−β\)wit/∑i′=1Nwi′t\+β/N,ifi=It,0,otherwise\.\\displaystyle\\tilde\{\\ell\}\_\{i\}^\{t\}=\\begin\{cases\}\\frac\{\\ell\_\{i\}^\{t\}\}\{\(1\-\\beta\)w\_\{i\}^\{t\}/\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\+\\beta/N\},&\\text\{if\}\\ i=I\_\{t\},\\\\ 0,&\\text\{otherwise\.\}\\end\{cases\}\(13\)Unfortunately, such an EXP3 scheme is still not truthful as shown below\.

###### Lemma 4

The benchmark 4 of EXP3 scheme is not truthful\.

The proof is given in Appendix E of the supplementary material of this TMC submission\. Lemma[4](https://arxiv.org/html/2605.24052#Thmlemma4)indicates that it is non\-trivial to incentivize workers’ truthful reports under the limited worker feedback case\. Therefore, a new weight update scheme needs to be investigated for limited worker feedback\.

### VI\-DOur Truthful Online Mixed Selection Mechanism Design and Analysis

As the platform can only observe one worker’s feedback in each time slot, it needs to carefully balance the exploitation of selected workers and the exploration of unselected ones during the online learning process\. Further, with such limited feedback information, it is even more challenging to guarantee a sublinear regret\. The idea of our new mechanism design is to introduce an exploitation parameter and update the selected worker’s weight with some probability\. We define our mechanism in the following\.

###### Definition 5\(Online Mixed Selection Mechanism\)

At Stage III of each time slott∈\[T\]t\\in\[T\], the platform updates each worker’s weightwit\+1w\_\{i\}^\{t\+1\}in \([11](https://arxiv.org/html/2605.24052#S6.E11)\) as follows:

wit\+1=\{\(1−β\)γit\+1\+β,ifi=It,wit,otherwise,\\displaystyle w\_\{i\}^\{t\+1\}=\\begin\{cases\}\(1\-\\beta\)\\gamma\_\{i\}^\{t\+1\}\+\\beta,&\\text\{if\}\\ i=I\_\{t\},\\\\ w\_\{i\}^\{t\},&\\text\{otherwise,\}\\end\{cases\}\(14\)whereβ∈\(0,1\)\\beta\\in\(0,1\)is an exploitation parameter to determine later,

γit\+1=\{γit\(1−αℓit\(1−α/θit\)θit\),ifi=It,γit,otherwise,\\displaystyle\\gamma\_\{i\}^\{t\+1\}=\\begin\{cases\}\\gamma\_\{i\}^\{t\}\\bigg\(1\-\\alpha\\frac\{\\ell\_\{i\}^\{t\}\(1\-\\alpha/\\theta\_\{i\}^\{t\}\)\}\{\\theta\_\{i\}^\{t\}\}\\bigg\),&\\text\{if\}\\ i=I\_\{t\},\\\\ \\gamma\_\{i\}^\{t\},&\\text\{otherwise\},\\end\{cases\}\(15\)θit=wit∑i′=1Nwi′t\\theta\_\{i\}^\{t\}=\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}denotes the probability of selecting workerii,γi1=1\\gamma\_\{i\}^\{1\}=1fori∈\[N\]i\\in\[N\], andα∈\(0,θit\)\\alpha\\in\(0,\\theta\_\{i\}^\{t\}\)is a step\-size parameter to determine later\.

Our mechanism in Definition[5](https://arxiv.org/html/2605.24052#Thmdefinition5)balances the exploration and the exploitation during the online learning process\. It updates each chosen worker’s weightwit\+1w\_\{i\}^\{t\+1\}in \([14](https://arxiv.org/html/2605.24052#S6.E14)\) only with a probability of1−β1\-\\betafor uniform exploration\. Note that if a worker is frequently chosen before, his weight keeps decreasing from 1 and becomes smaller than the others\. Therefore, the probability that he will be chosen in future time slots is lower than that of the unchosen workers, especially when his feedback accuracy is low\. Further, a chosen workerii’s weightwit\+1w\_\{i\}^\{t\+1\}in \([14](https://arxiv.org/html/2605.24052#S6.E14)\) is proportional toγit\+1\\gamma\_\{i\}^\{t\+1\}in \([15](https://arxiv.org/html/2605.24052#S6.E15)\), which will be decreased by a small value if his feedback lossℓ^it\\hat\{\\ell\}\_\{i\}^\{t\}is small, otherwise large\. By carefully designingα\\alphaandβ\\beta, our mechanism can guarantee validwitw\_\{i\}^\{t\}in \([14](https://arxiv.org/html/2605.24052#S6.E14)\) andγit\\gamma\_\{i\}^\{t\}in \([15](https://arxiv.org/html/2605.24052#S6.E15)\)\.

Since each worker holds a Bernoulli belief on the ground truthpjtp\_\{j\}^\{t\}, our mechanism satisfies the truthful property as shown below\.

###### Proposition 4

Our mechanism in Definition[5](https://arxiv.org/html/2605.24052#Thmdefinition5)is truthful, i\.e\.,𝒫^i∗\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}^\{\*\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)for any promptj∈\[mt\]j\\in\[m\_\{t\}\], SUi∈\[N\]i\\in\[N\]and time slott∈\[T\]t\\in\[T\]\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x1.png)Figure 2:Each workerii’s weightwitw\_\{i\}^\{t\}of our truthful online weighted aggregation versus time slotttunder full feedback\. Here we fix worker numberN=5N=5, total time slot numberT=500T=500, and prompt numbermt=20m\_\{t\}=20\.The proof is given in Appendix F of the supplementary material of this TMC submission\. Further, our mechanism is efficient and incurs a vanishing time\-average regret inTT\.

###### Theorem 2

Our mechanism in Definition[5](https://arxiv.org/html/2605.24052#Thmdefinition5)incurs the sublinear accuracy regretRℳ\(T\)R\_\{\\mathcal\{M\}\}\(T\)=𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)by choosing step\-sizesβ\\betain \([14](https://arxiv.org/html/2605.24052#S6.E14)\) andα\\alphain \([15](https://arxiv.org/html/2605.24052#S6.E15)\) as follows:

β=2Nln⁡N7T,α=ln⁡N7NT,\\displaystyle\\beta=2\\sqrt\{\\frac\{N\\ln N\}\{7T\}\},\\ \\alpha=\\sqrt\{\\frac\{\\ln N\}\{7NT\}\},leading to zero time\-average regret withlimT→∞Rℳ\(T\)T=0\\lim\_\{T\\to\\infty\}\\frac\{R\_\{\\mathcal\{M\}\}\(T\)\}\{T\}=0\.

The proof is given in Appendix G of the supplementary material of this TMC submission\. According to Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2), our mechanism can distinguish the most accurate worker in the online process asT→∞T\\to\\infty\. As the worker numberNNincreases, the platform faces with more uncertainty of feedback accuracy in exploration than exploitation\. Thus, it becomes more patient with a larger exploitation parameterβ\\betaand a smaller step\-sizeα\\alphato punish inaccurate \(chosen\) workers less in the weight update\. On the other hand, as the time slot numberTTincreases, the platform has more room to explore for the most accurate worker with more time slots and samples\. Thus, it chooses a smaller exploitation parameterβ\\betaand a smaller step\-sizeα\\alphato punish inaccurate \(chosen\) workers less in the weight update\.

## VIIExperiments

In this section, we run experiments to show our mechanism’s great improvement over the benchmark schemes\. In Section[VII\-A](https://arxiv.org/html/2605.24052#S7.SS1), we use synthetic data to evaluate the performance of our proposed mechanisms against benchmarks\. In Sections[VII\-B](https://arxiv.org/html/2605.24052#S7.SS2), we further fine\-tune LLMs on real\-world datasets to validate our mechanisms’ great advantages\.

### VII\-AExperiments on Synthetic Data

For ease of exposition and illustration, we first considerNN=5 workers and fix prompt numbermtm\_\{t\}=20\. We randomly generate the binary realized ground truthpjt∈\{1,0\}p\_\{j\}^\{t\}\\in\\\{1,0\\\}for each promptjj∈\\in\[mt\]\[m\_\{t\}\]\. We add random noise from the set of\{\[0,0\.1\],\[0\.45,0\.55\],\[0\.55,0\.65\],\[0\.65,0\.75\],\[0\.75,0\.85\]\}\\\{\[0,0\.1\],\[0\.45,0\.55\],\[0\.55,0\.65\],\[0\.65,0\.75\],\[0\.75,0\.85\]\\\}to workers’ preferences in order\. For example, worker 1 is the most accurate with his preference noise in the range of \[0, 0\.1\] on the realized ground truth, and worker 5 is the least accurate with his preference noise in the range of \[0\.75, 0\.85\] on the realized binary ground truth\.

Figure[2](https://arxiv.org/html/2605.24052#S6.F2)shows how each workerii’s weightwitw\_\{i\}^\{t\}evolves over time slottt∈\\in\[T\]\[T\]under full worker feedback, where each worker’s feedback is available to the platform in each time slottt\. Asttincreases to 500, our mechanism manages to assign the largest weight to the most accurate worker 1 and assign much smaller weights to the remaining ones especially for the most inaccurate worker 5, which is consistent with each worker’s weight \([6](https://arxiv.org/html/2605.24052#S5.E6)\) in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x2.png)Figure 3:Each workerii’s chosen probabilitywit∑i′=1Nwi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}of our truthful online weighted aggregation versus time slotttunder full feedback\. Here we fix worker numberN=5N=5, total time slotT=500T=500, and prompt numbermt=20m\_\{t\}=20\.Figure[3](https://arxiv.org/html/2605.24052#S7.F3)is similar to Fig\.[2](https://arxiv.org/html/2605.24052#S6.F2)and shows how each workerii’s chosen probabilitywit∑i′=1Nwi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}evolves over time slottt\. Asttincreases to 500, our mechanism manages to assign the largest probability to the most accurate worker 1 \(over 0\.9\) and assign near\-zero probabilities to the remaining ones especially for the most inaccurate worker 5, which further verifies the effectiveness of our mechanism in Definition[4](https://arxiv.org/html/2605.24052#Thmdefinition4)\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x3.png)Figure 4:Each workerii’s weightwitw\_\{i\}^\{t\}of our truthful online mixed selection versus time slotttunder limited worker feedback\. Here we fix worker numberN=5N=5, total time slot numberT=2500T=2500, and prompt numbermt=20m\_\{t\}=20\.![Refer to caption](https://arxiv.org/html/2605.24052v1/x4.png)Figure 5:Each workerii’s chosen probabilitywit∑i′=1Nwi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}of our truthful online mixed selection versus time slotttunder limited worker feedback\. Here we fix worker numberN=5N=5, total time slot numberT=2500T=2500, and prompt numbermt=20m\_\{t\}=20\.Figure[4](https://arxiv.org/html/2605.24052#S7.F4)shows how each workerii’s weightwitw\_\{i\}^\{t\}evolves over time slottt∈\\in\[T\]\[T\]under limited worker feedback, where only one selected worker’s sensing feedback is available to the platform in each time slottt\. It indicates that our mechanism still manages to assign the largest weight to the most accurate worker 1 overtime, which is consistent with \([14](https://arxiv.org/html/2605.24052#S6.E14)\) in Definition[5](https://arxiv.org/html/2605.24052#Thmdefinition5)\. Yet, as the platform only has access to one selected worker’s feedback in each time slot, our mechanism needs more time slots to assign relatively small weights to the remaining ones compared to Fig\.[2](https://arxiv.org/html/2605.24052#S6.F2)under full feedback\.

Figure[5](https://arxiv.org/html/2605.24052#S7.F5)is similar to Fig\.[4](https://arxiv.org/html/2605.24052#S7.F4)and shows how each workerii’s chosen probabilitywit∑i′=1Nwi′t\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}evolves over time slottt∈\\in\[T\]\[T\]under limited feedback\. It indicates that our mechanism manages to assign the largest probability to the most accurate worker 1 \(near 0\.8\), which further verifies the effectiveness of our mechanism in Definition[5](https://arxiv.org/html/2605.24052#Thmdefinition5)\. Yet, under limited feedback, our mechanism needs more time slots \(asttincreases to 2500\) to converge to assign relatively small probabilities to the remaining ones\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x5.png)Figure 6:Time\-average regrets of our mechanism under full worker feedback versus the time slot numberTTfor different worker numbersN∈\{5,10,15,20,25\}N\\in\\\{5,10,15,20,25\\\}\. Here we fix prompt numbermt=20m\_\{t\}=20and choose the learning rateα=232ln⁡NT\\alpha=\\frac\{2\}\{3\}\\sqrt\{\\frac\{2\\ln N\}\{T\}\}following Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)\.Figure[6](https://arxiv.org/html/2605.24052#S7.F6)shows the time\-average regretR\(T\)/TR\(T\)/Tof our mechanism under full worker feedback versus the time slot numberTTfor different worker numbersN∈\{5,10,15,20,25\}N\\in\\\{5,10,15,20,25\\\}\. We find that as the worker numberNNincreases, our mechanism needs more time slots to converge because the platform must explore and down\-weight more inaccurate workers before identifying the most accurate one\. Nevertheless, across all tested worker scales, our mechanism’s time\-average regret consistently decreases withTTand approaches 0, verifying the sublinear𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret guarantee in Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)\. This further demonstrates that our mechanism scales gracefully with the worker number under full feedback\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x6.png)Figure 7:Time\-average regrets of benchmarks 1, 2, 3 and our mechanism under full worker feedback versus the time slot numberTT, respectively\. Here we choose the same learning rateα=η=232ln⁡NT\\alpha=\\eta=\\frac\{2\}\{3\}\\sqrt\{\\frac\{2\\ln N\}\{T\}\}for both benchmark 2 and our method, consider a large worker scale ofN=50N=50and fix prompt numbermt=20m\_\{t\}=20\.![Refer to caption](https://arxiv.org/html/2605.24052v1/x7.png)Figure 8:Time\-average regrets of our mechanism under limited worker feedback versus the time slot numberTTfor different worker numbersN∈\{5,10,15,20,25\}N\\in\\\{5,10,15,20,25\\\}\. Here we fix prompt numbermt=20m\_\{t\}=20and chooseα=ln⁡N7NT\\alpha=\\sqrt\{\\frac\{\\ln N\}\{7NT\}\}andβ=2Nln⁡N7T\\beta=2\\sqrt\{\\frac\{N\\ln N\}\{7T\}\}following Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2)\.Figure[7](https://arxiv.org/html/2605.24052#S7.F7)shows the time\-average regretsR\(T\)/TR\(T\)/Tof benchmarks 1, 2, 3, and our mechanism under full worker feedback versus the time slot numberTTunder a large worker scale ofN=50N=50\. We find that the platform’s time\-average regret is greatly reduced by our mechanism from the three benchmarks\. Besides, time\-average regrets of benchmarks 1\-3 do not decrease withTTand are always greater than zero, respectively\. Differently, our mechanism’s time\-average regret decreases withTTand tends to 0, consistent with Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)\.

Figure[8](https://arxiv.org/html/2605.24052#S7.F8)shows the time\-average regretR\(T\)/TR\(T\)/Tof our mechanism under limited worker feedback versus the time slot numberTTfor different worker numbersN∈\{5,10,15,20,25\}N\\in\\\{5,10,15,20,25\\\}\. Compared with the full feedback case in Fig\.[6](https://arxiv.org/html/2605.24052#S7.F6), the convergence requires substantially more time slots because only one worker’s feedback is available per time slot, reducing the per\-slot per\-worker observation rate to1/N1/N\. The effect ofNNon convergence is therefore more pronounced under limited feedback\. Nevertheless, our mechanism’s time\-average regret still decreases withTTand approaches 0 across all tested worker scales, verifying the sublinear𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret guarantee in Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2)\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x8.png)Figure 9:Time\-average regrets of benchmarks 1, 3, 4 and our mechanism under limited worker feedback versus the time slot numberTT, respectively\. Here we chooseα=η=ln⁡N7NT\\alpha=\\eta=\\sqrt\{\\frac\{\\ln N\}\{7NT\}\}andβ=2Nln⁡N7T\\beta=2\\sqrt\{\\frac\{N\\ln N\}\{7T\}\}for both benchmark 4 and our online mixed selection scheme, consider a large worker scale ofN=50N=50and fix the prompt numbermt=20m\_\{t\}=20\. For a fair comparison, we randomly select 10 workers’ feedback out of the total 50 for the median scheme to make a decision\.Figure[9](https://arxiv.org/html/2605.24052#S7.F9)shows the time\-average regretsR\(T\)/TR\(T\)/Tof benchmarks 1, 3, 4, and our mechanism under limited feedback versus the time slot numberTTunder a large worker scale ofN=50N=50\. We find that the platform’s time\-average regret is still greatly reduced by our mechanism from the three benchmarks\. Besides, time\-average regrets of benchmarks 1, 3, and 4 do not decrease withTTand are always greater than zero, respectively\. Note that compared with the full feedback case in Fig\.[7](https://arxiv.org/html/2605.24052#S7.F7), our mechanism needs more time slots to obtain small enough time\-average regret under limited feedback\. Nevertheless, our mechanism’s time\-average regret still decreases withTTand tends to 0, consistent with Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2)\.

### VII\-BExperiments on LLM Fine\-Tuning on Real\-World Datasets

In this subsection, we further evaluate our proposed mechanisms’ performance against the benchmark schemes by fine\-tuning GPT\-2 for a downstream cooperative spectrum sensing \(CSS\) task to extend from the conference version of this paper\[[1](https://arxiv.org/html/2605.24052#bib.bib1)\]\. We follow\[[59](https://arxiv.org/html/2605.24052#bib.bib59)\]to use the real\-world WiFi SDR dataset, which is collected using 5 USRP N210s SDRs running GNU Radio\. The dataset contains about 500,000 transmissions over four 5 MHz non\-overlapping channels and records each channel’s primary user \(PU\) presence over time, occupying a total of 20 MHz bandwidth\. It also contains heterogeneous secondary users’ \(SUs\) probabilistic beliefs about PU presence on each subband from their local spectrum measurements as included in the original dataset, with different noise levels and biases modeling varying sensing quality\. The complex baseband signal is segmented into non\-overlapping windows of 32 samples and transformed with a 32\-point FFT; the resulting spectrum is partitioned into four equal subbands\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/new.png)Figure 10:A prompt example for LLM fine\-tuning training and testing regarding the downstream CSS task\.We choose the total time round number asT=20T=20, where each round contains 20,000 data for LLM fine\-tuning training and 5,000 data for testing in a time\-sequential order\. As shown in Fig\.[10](https://arxiv.org/html/2605.24052#S7.F10), we follow Section II\-A to construct prompts for LLM fine\-tuning training and testing\.

We run benchmarks 1–3 and 5–6 and our truthful online aggregation mechanism in Definition 4 to aggregate SU reports, respectively, where benchmark 5 \(NeurIPS 2025\[[60](https://arxiv.org/html/2605.24052#bib.bib60)\]\) is a recent state\-of\-the\-art online weighted\-majority\-voting scheme also designed for LLM fine\-tuning with preference data aggregated from multiple experts, and benchmark 6 is a Bayesian Beta\-Binomial aggregation baseline over a latent worker\-accuracy parameter\. This yields per\-subband scores interpreted as the probability of PU presence to construct human annotation sensing datasets\. The sensing datasets are then used to update a CSS policy using DPO to solve a KL\-regularized optimization problem against the reference policy as in \([2](https://arxiv.org/html/2605.24052#S3.E2)\), where we setβ=0\.01\\beta=0\.01as a parameter of evaluating the deviation from the reference policyπref\\pi\_\{\\texttt\{ref\}\}\. GPT\-2 is then fine\-tuned separately using each sensing dataset with LoRA\[[47](https://arxiv.org/html/2605.24052#bib.bib47)\]adapters \(rankr=8r=8, scaling factorα=16\\alpha=16, dropout0\.050\.05\) inserted into the attention projection matrices, while the base GPT\-2 weights remain frozen\. We use a maximum sequence length of 512 tokens, batch size 8, up to 3 epochs, with AdamW at learning rate5×10−55\\times 10^\{\-5\}\. For GPT\-2 \(124M\), the trainable LoRA parameters total approximately 0\.3M, i\.e\., roughly 0\.24% of the full model\. At evaluation time, both the base and fine\-tuned models classify by comparing the sequence log\-likelihoods of the two fixed completionsBUSYandIDLE\.

To evaluate the fine\-tuning performance under different approaches, we introduce the renowned metric win\-rate in the LLM fine\-tuning literature \(e\.g\.,\[[23](https://arxiv.org/html/2605.24052#bib.bib23)\],\[[45](https://arxiv.org/html/2605.24052#bib.bib45)\],\[[11](https://arxiv.org/html/2605.24052#bib.bib11)\]\), defined as the fraction of test prompts on which the fine\-tuned policy predicts the correct label while the reference model GPT\-2 fails as follows:

Win\-rate:=1Ntest∑n=1Ntest𝟏\(y^npol=yn∧y^nref≠yn\),\\text\{Win\-rate\}:=\\frac\{1\}\{N\_\{test\}\}\\sum\_\{n=1\}^\{N\_\{test\}\}\\mathbf\{1\}\\\!\\left\(\\hat\{y\}\_\{n\}^\{\\mathrm\{pol\}\}=y\_\{n\}\\ \\wedge\\ \\hat\{y\}\_\{n\}^\{\\mathrm\{ref\}\}\\neq y\_\{n\}\\right\),whereNtest=5000N\_\{test\}=5000denotes the number of testing prompt in each round,yny\_\{n\}denotes the ground\-truth PU presence,ynpoly\_\{n\}^\{\\mathrm\{pol\}\}denotes the LLM\-generated PU presence under a policy, andynrefy\_\{n\}^\{\\mathrm\{ref\}\}denotes the basic GPT\-2 generated PU presence\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x9.png)Figure 11:Win\-rates after fine\-tuning the GPT\-2 \(124M\) with benchmarks 1–3, 5, 6 and our mechanism in Definition 4 versus time rounds under full SU feedback, respectively\.![Refer to caption](https://arxiv.org/html/2605.24052v1/x10.png)Figure 12:Win\-rates after fine\-tuning the GPT\-2 with benchmarks 1–3, 5, 6 and our mechanism in Definition 4 at time roundT=20T=20versus GPT\-2 model size under full SU feedback, respectively\.Figure[11](https://arxiv.org/html/2605.24052#S7.F11)shows the win\-rates versus time rounds after fine\-tuning the GPT\-2 \(124M\) with benchmarks 1–3, 5, 6 and our mechanism in Definition 4 under full SU feedback, respectively\. It indicates that our mechanism achieves the highest win\-rate among all the approaches\. Due to untruthfulness, benchmarks 1–2 fail to obtain a meaningful human\-annotation sensing dataset for fine\-tuning, leading to an even degraded performance compared with the basic GPT\-2 model over time\. The state\-of\-the\-art scheme \(benchmark 5\) and the Bayesian Beta\-Binomial benchmark \(benchmark 6\) also underperform our mechanism, with win\-rates clustering near the untruthful Hedge and EM benchmarks because neither scheme filters out strategic misreport from selfish workers\. Compared to benchmark 3 of median scheme, our mechanism successfully incentivizes truthful reports from all SUs and gradually assigns the largest weights to the most accurate SUs as time evolves, leading to a substantially larger win\-rate\.

Figure[12](https://arxiv.org/html/2605.24052#S7.F12)shows win\-rates after fine\-tuning the GPT\-2 with benchmarks 1–3, 5, 6 and our mechanism in Definition 4 at time roundT=20T=20versus GPT\-2 model size under full SU feedback, respectively\. Note that as the GPT\-2 model size increases, the basic model without fine\-tuning performs better on PU presence prediction, leading to a slight decrease of win\-rate for all the approaches\. Nonetheless, our mechanism still achieves the highest win\-rate among all the approaches across all GPT\-2 model sizes, substantially outperforming both the classical and recent benchmark schemes, consistent with our truthfulness and regret guarantees in Theorems 1 and 2\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x11.png)Figure 13:Win\-rates after fine\-tuning the GPT\-2 \(124M\) with benchmarks 1, 3, 4, 5, 6 and our mechanism in Definition 5 versus time rounds under limited SU feedback, respectively\. Here we choose limited SU number as 3 for benchmark 3 of the median scheme for a fair comparison\.![Refer to caption](https://arxiv.org/html/2605.24052v1/x12.png)Figure 14:Win\-rates after fine\-tuning the GPT\-2 with benchmarks 1, 3, 4, 5, 6 and our mechanism in Definition 5 at time roundT=20T=20versus GPT\-2 model size under limited SU feedback, respectively\.Figure[13](https://arxiv.org/html/2605.24052#S7.F13)shows the win\-rates after fine\-tuning the GPT\-2 \(124M\) with benchmarks 1, 3, 4, 5, 6 and our mechanism in Definition 5 versus time rounds under limited SU feedback, respectively\. With limited SU feedback, the sensing accuracy of all the approaches degrades and we expect a decrease in all the final win\-rates\. Nevertheless, our mechanism still achieves the highest win\-rate among all the approaches, and benchmarks 5 and 6 again underperform for the same reasons of untruthfulness under strategic workers\. Similar insights can also be found in Figure[14](https://arxiv.org/html/2605.24052#S7.F14)\.

Figure[15](https://arxiv.org/html/2605.24052#S7.F15)shows the win\-rates of our mechanism in Definition 4 across three worker scalesN∈\{5,15,25\}N\\in\\\{5,15,25\\\}and four GPT\-2 model sizes, with the median benchmark included as a reference\. AsNNincreases, our mechanism needs more time slots to identify and up\-weight the most accurate workers, leading to a mild degradation in win\-rate at fixedT=20T=20under all model sizes\. Nevertheless, across all tested worker scales and model sizes, our mechanism consistently outperforms the median benchmark \(best among the baselines\), demonstrating that our mechanism scales gracefully with the number of workers under full feedback\.

Figure[16](https://arxiv.org/html/2605.24052#S8.F16)shows the counterpart results under limited SU feedback\. Compared with the full feedback case in Fig\.[15](https://arxiv.org/html/2605.24052#S7.F15), the degradation withNNis more pronounced because only one worker’s feedback is available per time slot, reducing the per\-slot per\-worker observation rate to1/N1/Nand slowing the convergence\. Nevertheless, our mechanism still beats the median benchmark \(best among the baselines\) across all tested worker scales and model sizes, confirming that the mechanism remains effective under the more challenging limited\-feedback setting\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x13.png)Figure 15:Win\-rates after fine\-tuning the GPT\-2 with our mechanism in Definition 4 and benchmark 3 \(median\) at time roundT=20T=20versus GPT\-2 model size under full SU feedback, for worker numbersN∈\{5,15,25\}N\\in\\\{5,15,25\\\}\.

## VIIIConclusion

In this paper, we study truthful online preference aggregation for LLM fine\-tuning in mobile crowdsourcing against selfish workers\. We prove that existing adaptive aggregation methods \(e\.g\., EM\-based weight estimation, Hedge, median, and EXP3\) are not truthful and fail to identify the most accurate worker in the online setting\. To address this challenge, we developed a truthful online weighted aggregation mechanism that dynamically adjusts each worker’s weight based on historical feedback accuracy\. Our mechanism incentivizes truthful reporting and achieves a sublinear regret of𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)\. We further extend our design to the challenging scenario where only one worker’s feedback is available per time slot, and we prove that truthful reporting and𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret can still be guaranteed\. Experiments on LLM fine\-tuning with real\-world datasets demonstrated substantial performance improvements over existing benchmark schemes\.

A promising direction for future work is to consider heterogeneous coverage constraints in mobile applications\. For example, a worker may only observe a subset of prompts or environmental conditions \(e\.g\., specific routes in navigation or specific frequency bands in spectrum sensing\)\. In such cases, the platform must assign tasks to workers under partial observability, making it non\-trivial to extend our analysis to both full and limited\-feedback settings\.

![Refer to caption](https://arxiv.org/html/2605.24052v1/x14.png)Figure 16:Win\-rates after fine\-tuning the GPT\-2 with our mechanism in Definition 5 and benchmark 3 \(median\) at time roundT=20T=20versus GPT\-2 model size under limited SU feedback, for worker numbersN∈\{5,15,25\}N\\in\\\{5,15,25\\\}\.
## References

- \[1\]S\. Hao and L\. Duan, “Online learning from strategic human feedback in llm fine\-tuning,” in*ICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*\. IEEE, 2025, pp\. 1–5\.
- \[2\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar*et al\.*, “Open and efficient foundation language models,”*Preprint at arXiv\. https://doi\. org/10\.48550/arXiv*, vol\. 2302, 2023\.
- \[3\]M\. Xu, D\. Niyato, H\. Zhang, J\. Kang, Z\. Xiong, S\. Mao, and Z\. Han, “Cached model\-as\-a\-resource: Provisioning large language model agents for edge intelligence in space\-air\-ground integrated networks,”*IEEE Transactions on Networking*, 2025\.
- \[4\]“Waze,”https://www\.waze\.com, accessed 2025\.
- \[5\]Z\. He, Y\. Liu, and F\. R\. Yu, “Crowdsensing\-based urban traffic monitoring using mobile social media,”*IEEE Communications Magazine*, 2016\.
- \[6\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray*et al\.*, “Training language models to follow instructions with human feedback,”*Advances in neural information processing systems*, vol\. 35, pp\. 27 730–27 744, 2022\.
- \[7\]“Gemini apps privacy hub,”https://support\.google\.com/gemini/answer/13594961, accessed 2025\.
- \[8\]H\. Sun, Y\. Chen, S\. Wang, W\. Chen, and X\. Deng, “Mechanism design for llm fine\-tuning with multiple reward models,”*arXiv preprint arXiv:2405\.16276*, 2024\.
- \[9\]E\. Soumalias, M\. J\. Curry, and S\. Seuken, “Truthful aggregation of llms with an application to online advertising,”*arXiv preprint arXiv:2405\.05905*, 2024\.
- \[10\]C\. Park, M\. Liu, D\. Kong, K\. Zhang, and A\. E\. Ozdaglar, “Rlhf from heterogeneous feedback via personalization and preference aggregation,” in*ICML 2024 Workshop on Theoretical Foundations of Foundation Models*, 2024\.
- \[11\]V\. Conitzer, R\. Freedman, J\. Heitzig, W\. H\. Holliday, B\. M\. Jacobs, N\. Lambert, M\. Mossé, E\. Pacuit, S\. Russell, H\. Schoelkopf*et al\.*, “Social choice for ai alignment: Dealing with diverse human feedback,”*arXiv preprint arXiv:2404\.10271*, 2024\.
- \[12\]T\. Roughgarden and O\. Schrijvers, “Online prediction with selfish experts,”*Advances in Neural Information Processing Systems*, vol\. 30, 2017\.
- \[13\]J\. J\. Chandler and G\. Paolacci, “Lie for a dime: When most prescreening responses are honest but most study participants are impostors,”*Social Psychological and Personality Science*, vol\. 8, no\. 5, pp\. 500–508, 2017\.
- \[14\]R\. Kennedy, S\. Clifford, T\. Burleigh, P\. D\. Waggoner, R\. Jewell, and N\. J\. Winter, “The shape of and solutions to the mturk quality crisis,”*Political Science Research and Methods*, vol\. 8, no\. 4, pp\. 614–629, 2020\.
- \[15\]J\. Perez, J\. Via, L\. Vielva, and D\. Ramirez, “Online detection and snr estimation in cooperative spectrum sensing,”*IEEE Transactions on Wireless Communications*, vol\. 21, no\. 4, pp\. 2521–2533, 2021\.
- \[16\]J\. Perez, I\. Santamaria, and J\. Via, “Adaptive em\-based algorithm for cooperative spectrum sensing in mobile environments,” in*2018 IEEE Statistical Signal Processing Workshop \(SSP\)*\. IEEE, 2018, pp\. 732–736\.
- \[17\]Z\. Fan, A\. Maiti, L\. J\. Ratliff, K\. Jamieson, and G\. Farina, “On the universal near optimality of hedge in combinatorial settings,” in*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.
- \[18\]H\. Zhou, Z\. Xu, and V\. Tzoumas, “Efficient online learning with memory via frank\-wolfe optimization: Algorithms with bounded dynamic regret and applications to control,” in*2023 62nd IEEE Conference on Decision and Control \(CDC\)*\. IEEE, 2023, pp\. 8266–8273\.
- \[19\]K\. A\. Dubey, Z\. Feng, R\. Kidambi, A\. Mehta, and D\. Wang, “Auctions with llm summaries,”*arXiv preprint arXiv:2404\.08126*, 2024\.
- \[20\]M\. Xu, D\. Niyato, B\. Wright, H\. Zhang, J\. Kang, Z\. Xiong, S\. Mao, and Z\. Han, “Epvisa: Efficient auction design for real\-time physical\-virtual synchronization in the human\-centric metaverse,”*IEEE Journal on Selected Areas in Communications*, vol\. 42, no\. 3, pp\. 694–709, 2023\.
- \[21\]H\. Dong, W\. Xiong, B\. Pang, H\. Wang, H\. Zhao, Y\. Zhou, N\. Jiang, D\. Sahoo, C\. Xiong, and T\. Zhang, “Rlhf workflow: From reward modeling to online rlhf,”*Transactions on Machine Learning Research*\.
- \[22\]C\. Ye, W\. Xiong, Y\. Zhang, H\. Dong, N\. Jiang, and T\. Zhang, “Online iterative reinforcement learning from human feedback with general preference model,”*Advances in Neural Information Processing Systems*, vol\. 37, pp\. 81 773–81 807, 2024\.
- \[23\]W\. Xiong, H\. Dong, C\. Ye, Z\. Wang, H\. Zhong, H\. Ji, N\. Jiang, and T\. Zhang, “Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl\-constraint,” in*Forty\-first International Conference on Machine Learning*, 2024\.
- \[24\]M\. Asadi, A\. Bellet, O\.\-A\. Maillard, and M\. Tommasi, “Collaborative algorithms for online personalized mean estimation,”*Transactions on Machine Learning Research Journal*, 2022\.
- \[25\]Y\. Chen, J\. Zhu, and K\. Kandasamy, “Mechanism design for collaborative normal mean estimation,”*Advances in Neural Information Processing Systems*, vol\. 36, 2024\.
- \[26\]J\. Li, M\. Li, and H\. Chan, “Strategyproof mechanisms for group\-fair obnoxious facility location problems,” in*Proceedings of the AAAI Conference on Artificial Intelligence*, vol\. 38, no\. 9, 2024, pp\. 9832–9839\.
- \[27\]Y\. Wang, H\. Zhou, and M\. Li, “Positive intra\-group externalities in facility location,” in*Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems*, 2024, pp\. 1883–1891\.
- \[28\]N\. Lambert, “Reinforcement learning from human feedback,”*arXiv preprint arXiv:2504\.12501*, 2025\.
- \[29\]H\. Moulin, “On strategy\-proofness and single peakedness,”*Public Choice*, vol\. 35, no\. 4, pp\. 437–455, 1980\.
- \[30\]D\. Prelec, “A Bayesian truth serum for subjective data,”*Science*, vol\. 306, no\. 5695, pp\. 462–466, 2004\.
- \[31\]N\. Miller, P\. Resnick, and R\. Zeckhauser, “Eliciting informative feedback: The peer\-prediction method,”*Management Science*, vol\. 51, no\. 9, pp\. 1359–1373, 2005\.
- \[32\]Y\. Liu and Y\. Chen, “Machine\-learning aided peer prediction,” in*Proceedings of the 2017 ACM Conference on Economics and Computation*, 2017, pp\. 63–80\.
- \[33\]J\. Witkowski, R\. Freeman, J\. W\. Vaughan, D\. M\. Pennock, and A\. Krause, “Incentive\-compatible forecasting competitions,”*Management Science*, vol\. 69, no\. 3, pp\. 1354–1374, 2023\.
- \[34\]A\. P\. Dawid and A\. M\. Skene, “Maximum likelihood estimation of observer error\-rates using the EM algorithm,”*Journal of the Royal Statistical Society: Series C \(Applied Statistics\)*, vol\. 28, no\. 1, pp\. 20–28, 1979\.
- \[35\]I\. A\. Kash, L\. Reyzin, and Z\. Yu, “Slowly changing adversarial bandit algorithms are efficient for discounted mdps,” in*International Conference on Algorithmic Learning Theory*\. PMLR, 2024, pp\. 683–718\.
- \[36\]M\. Khodak, I\. Osadchiy, K\. Harris, M\.\-F\. F\. Balcan, K\. Y\. Levy, R\. Meir, and S\. Z\. Wu, “Meta\-learning adversarial bandit algorithms,”*Advances in Neural Information Processing Systems*, vol\. 36, pp\. 35 441–35 471, 2023\.
- \[37\]N\. B\. Shah and D\. Zhou, “Double or nothing: Multiplicative incentive mechanisms for crowdsourcing,”*Journal of Machine Learning Research*, vol\. 17, no\. 165, pp\. 1–52, 2016\.
- \[38\]R\. Freeman, D\. Pennock, C\. Podimata, and J\. W\. Vaughan, “No\-regret and incentive\-compatible online learning,” in*International Conference on Machine Learning*\. PMLR, 2020, pp\. 3270–3279\.
- \[39\]Z\. Wei, C\. Li, T\. Ren, H\. Xu, and H\. Wang, “Incentivized truthful communication for federated bandits,” in*Proceedings of the 12th International Conference on Learning Representations \(ICLR 2024\)*\.
- \[40\]M\. M\. Karim, Y\. Shi, S\. Zhang, B\. Wang, M\. Nasri, and Y\. Wang, “Large language models and their applications in roadway safety and mobility enhancement: A comprehensive review,”*arXiv preprint arXiv:2506\.06301*, 2025\.
- \[41\]K\. Alsheeb and M\. Brandão, “Towards explainable road navigation systems,” in*2023 IEEE 26th International Conference on Intelligent Transportation Systems \(ITSC\)*\. IEEE, 2023, pp\. 16–22\.
- \[42\]M\. A\. Aygül, H\. A\. Çırpan, and H\. Arslan, “Machine learning\-based spectrum occupancy prediction: A comprehensive survey,”*Frontiers in Communications and Networks*, vol\. 6, p\. 1482698, 2025\.
- \[43\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan*et al\.*, “Training a helpful and harmless assistant with reinforcement learning from human feedback\. corr, abs/2204\.05862, 2022a\. doi: 10\.48550,”*arXiv preprint arXiv\.2204\.05862*, 2022\.
- \[44\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale*et al\.*, “Llama 2: Open foundation and fine\-tuned chat models,”*arXiv preprint arXiv:2307\.09288*, 2023\.
- \[45\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn, “Direct preference optimization: Your language model is secretly a reward model,”*Advances in Neural Information Processing Systems*, vol\. 36, 2024\.
- \[46\]P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei, “Deep reinforcement learning from human preferences,”*Advances in neural information processing systems*, vol\. 30, 2017\.
- \[47\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen, “Lora: Low\-rank adaptation of large language models,” in*International Conference on Learning Representations \(ICLR\)*, 2022\.
- \[48\]C\. Chen, K\. Petty, A\. Skabardonis, P\. Varaiya, and Z\. Jia, “Freeway performance measurement system: Mining loop detector data,”*Transportation Research Record*, vol\. 1748, no\. 1, pp\. 96–102, 2001\.
- \[49\]S\.\-H\. Li, Y\. Zhang, A\. Nosratinia, and J\. Yuan, “SHARP: Spectrum harvesting with ARQ retransmission and probing in cognitive radio,”*IEEE Transactions on Communications*, vol\. 61, no\. 3, pp\. 886–897, March 2013\.
- \[50\]K\. Liu and Q\. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,”*IEEE Transactions on Information Theory*, vol\. 56, no\. 11, pp\. 5547–5567, 2010\.
- \[51\]Q\. Zhao, L\. Tong, and A\. Swami, “Decentralized cognitive mac for dynamic spectrum access,” in*First IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, 2005\. DySPAN 2005\.*IEEE, 2005, pp\. 224–232\.
- \[52\]Federal Communications Commission, “Amendment of the Commission’s Rules with Regard to Commercial Operations in the 3550\-3650 MHz Band,” FCC 15\-47, Report and Order and Second Further Notice of Proposed Rulemaking, Apr\. 2015, gN Docket No\. 12\-354\.
- \[53\]P\. Marshall,*Three\-Tier Shared Spectrum, Shared Infrastructure, and a Path to 5G*\. Cambridge, UK: Cambridge University Press, 2017\.
- \[54\]Wireless Innovation Forum, “Signaling Protocols and Procedures for Citizens Broadband Radio Service \(CBRS\): Spectrum Access System \(SAS\) \- Citizens Broadband Radio Service Device \(CBSD\) Interface Technical Specification,” Wireless Innovation Forum, Tech\. Rep\. WINNF\-TS\-0016\-V1\.2\.1, 2017\.
- \[55\]Y\. Zhang and M\. Van der Schaar, “Reputation\-based incentive protocols in crowdsourcing applications,” in*2012 Proceedings IEEE INFOCOM*\. IEEE, 2012, pp\. 2140–2148\.
- \[56\]H\. Xie, J\. C\. Lui, and D\. Towsley, “Design and analysis of incentive and reputation mechanisms for online crowdsourcing systems,”*ACM Transactions on Modeling and Performance Evaluation of Computing Systems \(TOMPECS\)*, vol\. 1, no\. 3, pp\. 1–27, 2016\.
- \[57\]C\. Wu, Y\. Zhu, R\. Zhang, Y\. Chen, F\. Wang, and S\. Cui, “FedAB: Truthful federated learning with auction\-based combinatorial multi\-armed bandit,” vol\. 10, no\. 17, pp\. 15 159–15 170\.
- \[58\]Y\. Zhao, X\. Gong, and S\. Mao, “Truthful incentive mechanism for federated learning with crowdsourced data labeling,” in*Proceedings of IEEE INFOCOM 2023*, pp\. 1–10\.
- \[59\]D\. Uvaydov, S\. D’Oro, F\. Restuccia, and T\. Melodia, “Deepsense: Fast wideband spectrum sensing through real\-time in\-the\-loop deep learning,” in*IEEE INFOCOM 2021\-IEEE Conference on Computer Communications*\. IEEE, 2021, pp\. 1–10\.
- \[60\]L\. Liu and J\. Etesami, “Online mixture of experts: No\-regret learning for optimal collective decision\-making,” in*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.

![[Uncaptioned image]](https://arxiv.org/html/2605.24052v1/Bio1.png)Shugang Hao\(M’22\) received the Ph\.D\. degree from Singapore University of Technology and Design \(SUTD\) in 2022\. He is a postdoctoral research fellow at SUTD from Sep\. 2022\. His research interests are LLM for networks, networking for LLM, LLM fine\-tuning, game theory and mechanism design\. He served as the web chair of ACM SenSys 2024, the local arrangement chair of AIoTSys 2024 and the local arrangement chair of IEEE WiOpt 2023\.![[Uncaptioned image]](https://arxiv.org/html/2605.24052v1/Bio2.jpg)Lingjie Duan\(S’09\-M’12\-SM’17\) received the Ph\.D\. degree from The Chinese University of Hong Kong in 2012\. He is a full Professor in the Internet of Things Thrust and the Artificial Intelligence Thrust at Hong Kong University of Science and Technology, Guangzhou\. He was an Associate Professor and Associate Head of Pillar of Engineering Systems and Design with the Singapore University of Technology and Design \(SUTD\)\. In 2011, he was a Visiting Scholar at University of California at Berkeley, Berkeley, CA, USA\. His research interests include network economics and game theory, cognitive communications, and cooperative networking\. He is an Associate Editor of IEEE Transactions on Mobile Computing and IEEE Transactions on Networking\. He was an Editor of IEEE Transactions on Wireless Communications and IEEE Communications Surveys and Tutorials\. He also served as a Guest Editor of the IEEE Journal on Selected Areas in Communications Special Issue on Human\-in\-the\-Loop Mobile Networks, as well as IEEE Wireless Communications Magazine\. He received the SUTD Excellence in Research Award in 2016 and the 10th IEEE ComSoc Asia\-Pacific Outstanding Young Researcher Award in 2015\. He served as the general chair of IEEE WiOpt 2023\.## Appendix AProof of Lemma[1](https://arxiv.org/html/2605.24052#Thmlemma1)

First, we prove that Benchmark 1 \(EM\-based Weight Estimation\) is not truthful\. We construct a counterexample with horizonT=2T=2showing that truthful reporting is not a dominant strategy\. The total utility for workerkkis

uk=∑t=12wkt=wk1\+wk2\.u\_\{k\}=\\sum\_\{t=1\}^\{2\}w\_\{k\}^\{t\}=w\_\{k\}^\{1\}\+w\_\{k\}^\{2\}\.Since the initial weights are uniform \(wi1=w0,∀iw\_\{i\}^\{1\}=w\_\{0\},\\ \\forall i\) and independent of reports, maximizinguku\_\{k\}is equivalent to maximizing the expected weight at the next time step, i\.e\.,𝔼\[wk2\]\\mathbb\{E\}\[w\_\{k\}^\{2\}\]\.

Consider timet=1t=1and a workerkkwith private beliefq:=Pr⁡\(p1=1\)∈\(0,1\)q:=\\Pr\(p^\{1\}=1\)\\in\(0,1\)about the hidden binary outcomep1∈\{0,1\}p^\{1\}\\in\\\{0,1\\\}\. We compare two strategies:

- •StrategyStruthS\_\{\\text\{truth\}\}:report truthfully by sampling from the belief, i\.e\.,y^k1∼Bernoulli\(q\)\\hat\{y\}\_\{k\}^\{1\}\\sim\\mathrm\{Bernoulli\}\(q\)\(soPr⁡\(y^k1=1\)=q\\Pr\(\\hat\{y\}\_\{k\}^\{1\}=1\)=q\)\.
- •StrategySlieS\_\{\\text\{lie\}\}:report a constant extreme to align with the majority,y^k1=1\\hat\{y\}\_\{k\}^\{1\}=1\(*Always High*\)\.

We instantiate the EM benchmark with a standard Dawid–Skene likelihood\. Conditional on the latent truthp1p^\{1\}, each workeriireportsy^i1∈\{0,1\}\\hat\{y\}\_\{i\}^\{1\}\\in\\\{0,1\\\}correctly with probabilitywi∈\(0,1\)w\_\{i\}\\in\(0,1\):

Pr⁡\(y^i1=p1∣wi\)=wi,Pr⁡\(y^i1≠p1∣wi\)=1−wi\.\\Pr\(\\hat\{y\}\_\{i\}^\{1\}=p^\{1\}\\mid w\_\{i\}\)=w\_\{i\},\\qquad\\Pr\(\\hat\{y\}\_\{i\}^\{1\}\\neq p^\{1\}\\mid w\_\{i\}\)=1\-w\_\{i\}\.We place an independent Beta priorwi∼Beta\(α,β\)w\_\{i\}\\sim\\mathrm\{Beta\}\(\\alpha,\\beta\)with fixedα,β\>0\\alpha,\\beta\>0\. The prior on the truth isPr⁡\(p1=1\)=π∈\(0,1\)\\Pr\(p^\{1\}=1\)=\\pi\\in\(0,1\); for simplicity takeπ=12\\pi=\\frac\{1\}\{2\}\. Assume all initial weights are identical and satisfy the standard reliability assumption:

wi\(1\)=wi1=w0∈\(12,1\)∀i\.w\_\{i\}^\{\(1\)\}=w\_\{i\}^\{1\}=w\_\{0\}\\in\\Big\(\\frac\{1\}\{2\},1\\Big\)\\qquad\\forall i\.
Given current parameter estimates𝐰\(1\)\\mathbf\{w\}^\{\(1\)\}, the E\-step computes the posterior

γ:=Pr⁡\(p1=1∣𝐲^1,𝐰\(1\)\)\\displaystyle\\gamma:=\\Pr\(p^\{1\}=1\\mid\\hat\{\\mathbf\{y\}\}^\{1\},\\mathbf\{w\}^\{\(1\)\}\)=π∏i=1NPr⁡\(y^i1∣p1=1,w0\)π∏i=1NPr⁡\(y^i1∣p1=1,w0\)\+\(1−π\)∏i=1NPr⁡\(y^i1∣p1=0,w0\),\\displaystyle=\\frac\{\\pi\\prod\_\{i=1\}^\{N\}\\Pr\(\\hat\{y\}\_\{i\}^\{1\}\\mid p^\{1\}=1,w\_\{0\}\)\}\{\\pi\\prod\_\{i=1\}^\{N\}\\Pr\(\\hat\{y\}\_\{i\}^\{1\}\\mid p^\{1\}=1,w\_\{0\}\)\+\(1\-\\pi\)\\prod\_\{i=1\}^\{N\}\\Pr\(\\hat\{y\}\_\{i\}^\{1\}\\mid p^\{1\}=0,w\_\{0\}\)\},where𝐲^1=\(y^11,…,y^N1\)\\hat\{\\mathbf\{y\}\}^\{1\}=\(\\hat\{y\}\_\{1\}^\{1\},\\dots,\\hat\{y\}\_\{N\}^\{1\}\)\. The M\-step updates each worker’s reliability by the posterior\-mean rule

wi2=α\+𝔼\[𝟏\{y^i1=p1\}∣𝐲^1,𝐰\(1\)\]α\+β\+1,w\_\{i\}^\{2\}=\\frac\{\\alpha\+\\mathbb\{E\}\[\\mathbf\{1\}\\\{\\hat\{y\}\_\{i\}^\{1\}=p^\{1\}\\\}\\mid\\hat\{\\mathbf\{y\}\}^\{1\},\\mathbf\{w\}^\{\(1\)\}\]\}\{\\alpha\+\\beta\+1\},where

𝔼\[𝟏\{y^i1=p1\}∣𝐲^1,𝐰\(1\)\]=γ1\{y^i1=1\}\+\(1−γ\)1\{y^i1=0\}\.\\mathbb\{E\}\[\\mathbf\{1\}\\\{\\hat\{y\}\_\{i\}^\{1\}=p^\{1\}\\\}\\mid\\hat\{\\mathbf\{y\}\}^\{1\},\\mathbf\{w\}^\{\(1\)\}\]=\\gamma\\,\\mathbf\{1\}\\\{\\hat\{y\}\_\{i\}^\{1\}=1\\\}\+\(1\-\\gamma\)\\,\\mathbf\{1\}\\\{\\hat\{y\}\_\{i\}^\{1\}=0\\\}\.
Assume the otherN−1N\-1workers all reporty^i1=1\\hat\{y\}\_\{i\}^\{1\}=1at timet=1t=1\. Define

γ1:=Pr⁡\(p1=1∣y^k1=1,y^−k1=𝟏,𝐰\(1\)\),\\displaystyle\\gamma\_\{1\}:=\\Pr\(p^\{1\}=1\\mid\\hat\{y\}\_\{k\}^\{1\}=1,\\ \\hat\{y\}\_\{\-k\}^\{1\}=\\mathbf\{1\},\\ \\mathbf\{w\}^\{\(1\)\}\),γ0:=Pr⁡\(p1=1∣y^k1=0,y^−k1=𝟏,𝐰\(1\)\)\.\\displaystyle\\gamma\_\{0\}:=\\Pr\(p^\{1\}=1\\mid\\hat\{y\}\_\{k\}^\{1\}=0,\\ \\hat\{y\}\_\{\-k\}^\{1\}=\\mathbf\{1\},\\ \\mathbf\{w\}^\{\(1\)\}\)\.Because all workers share the samew0w\_\{0\}, the likelihood depends only on the number of ones\. Letmmbe the number of ones among theNNreports att=1t=1\. Under Dawid–Skene,

Pr⁡\(𝐲^1∣p1=1,w0\)Pr⁡\(𝐲^1∣p1=0,w0\)=\(w01−w0\)2m−N\.\\frac\{\\Pr\(\\hat\{\\mathbf\{y\}\}^\{1\}\\mid p^\{1\}=1,w\_\{0\}\)\}\{\\Pr\(\\hat\{\\mathbf\{y\}\}^\{1\}\\mid p^\{1\}=0,w\_\{0\}\)\}=\\left\(\\frac\{w\_\{0\}\}\{1\-w\_\{0\}\}\\right\)^\{\\,2m\-N\}\.Withπ=12\\pi=\\frac\{1\}\{2\}, Bayes’ rule gives

γ\(m\)=Pr⁡\(p1=1∣m\)=11\+\(1−w0w0\)2m−N\.\\gamma\(m\)=\\Pr\(p^\{1\}=1\\mid m\)=\\frac\{1\}\{1\+\\left\(\\frac\{1\-w\_\{0\}\}\{w\_\{0\}\}\\right\)^\{\\,2m\-N\}\}\.In our environment,

m=\{N,ify^k1=1,N−1,ify^k1=0,m=\\begin\{cases\}N,&\\text\{if \}\\hat\{y\}\_\{k\}^\{1\}=1,\\\\ N\-1,&\\text\{if \}\\hat\{y\}\_\{k\}^\{1\}=0,\\end\{cases\}soγ1=γ\(N\)\\gamma\_\{1\}=\\gamma\(N\)andγ0=γ\(N−1\)\\gamma\_\{0\}=\\gamma\(N\-1\)\. Sincew0\>12w\_\{0\}\>\\frac\{1\}\{2\}implies1−w0w0<1\\frac\{1\-w\_\{0\}\}\{w\_\{0\}\}<1, we have

γ1\>γ0\>12,γ0\+γ1\>1\.\\gamma\_\{1\}\>\\gamma\_\{0\}\>\\frac\{1\}\{2\},\\qquad\\gamma\_\{0\}\+\\gamma\_\{1\}\>1\.
Now compare the expected updated weight for workerkk\.

1\) Ifkklies \(i\.e\.,y^k1=1\\hat\{y\}\_\{k\}^\{1\}=1\), then𝟏\{y^k1=1\}=1\\mathbf\{1\}\\\{\\hat\{y\}\_\{k\}^\{1\}=1\\\}=1, so

𝔼\[𝟏\{y^k1=p1\}∣𝐲^1,𝐰\(1\)\]=γ1,\\mathbb\{E\}\[\\mathbf\{1\}\\\{\\hat\{y\}\_\{k\}^\{1\}=p^\{1\}\\\}\\mid\\hat\{\\mathbf\{y\}\}^\{1\},\\mathbf\{w\}^\{\(1\)\}\]=\\gamma\_\{1\},and hence

wk,lie2=α\+γ1α\+β\+1\.w\_\{k,\\text\{lie\}\}^\{2\}=\\frac\{\\alpha\+\\gamma\_\{1\}\}\{\\alpha\+\\beta\+1\}\.
2\) Ifkkis truthful \(i\.e\.,y^k1∼Bernoulli\(q\)\\hat\{y\}\_\{k\}^\{1\}\\sim\\mathrm\{Bernoulli\}\(q\)\), then conditioning ony^k1\\hat\{y\}\_\{k\}^\{1\}:

wk,truth2=\{α\+γ1α\+β\+1,ify^k1=1,α\+\(1−γ0\)α\+β\+1,ify^k1=0,w\_\{k,\\text\{truth\}\}^\{2\}=\\begin\{cases\}\\frac\{\\alpha\+\\gamma\_\{1\}\}\{\\alpha\+\\beta\+1\},&\\text\{if \}\\hat\{y\}\_\{k\}^\{1\}=1,\\\\\[6\.0pt\] \\frac\{\\alpha\+\(1\-\\gamma\_\{0\}\)\}\{\\alpha\+\\beta\+1\},&\\text\{if \}\\hat\{y\}\_\{k\}^\{1\}=0,\\end\{cases\}so taking expectation overy^k1\\hat\{y\}\_\{k\}^\{1\}gives

𝔼\[wk,truth2\]=q⋅α\+γ1α\+β\+1\+\(1−q\)⋅α\+1−γ0α\+β\+1\.\\mathbb\{E\}\[w\_\{k,\\text\{truth\}\}^\{2\}\]=q\\cdot\\frac\{\\alpha\+\\gamma\_\{1\}\}\{\\alpha\+\\beta\+1\}\+\(1\-q\)\\cdot\\frac\{\\alpha\+1\-\\gamma\_\{0\}\}\{\\alpha\+\\beta\+1\}\.Therefore,

𝔼\[wk,lie2\]−𝔼\[wk,truth2\]\\displaystyle\\mathbb\{E\}\[w\_\{k,\\text\{lie\}\}^\{2\}\]\-\\mathbb\{E\}\[w\_\{k,\\text\{truth\}\}^\{2\}\]=\\displaystyle=\(1−q\)⋅γ1\+γ0−1α\+β\+1\.\\displaystyle\(1\-q\)\\cdot\\frac\{\\gamma\_\{1\}\+\\gamma\_\{0\}\-1\}\{\\alpha\+\\beta\+1\}\.Sinceγ0\+γ1\>1\\gamma\_\{0\}\+\\gamma\_\{1\}\>1, the numerator is positive\. For any beliefq<1q<1, the difference is strictly positive\. It follows that

𝔼\[wk,lie2\]\>𝔼\[wk,truth2\]\.\\mathbb\{E\}\[w\_\{k,\\text\{lie\}\}^\{2\}\]\>\\mathbb\{E\}\[w\_\{k,\\text\{truth\}\}^\{2\}\]\.Thusuk\(lie\)\>uk\(truth\)u\_\{k\}\(\\text\{lie\}\)\>u\_\{k\}\(\\text\{truth\}\), so truthful reporting is not a dominant strategy\. This proves that Benchmark 1 is not truthful\.

Next, we prove that Benchmark 1 leads to a non\-vanishing regret\. Fix any horizonTTand consider the following construction\. There exists one workero∈\[N\]o\\in\[N\]such that

𝒫o\(yljt\>ylj′t∣xjt\)=pjt∀j∈\[mt\],t∈\[T\],\\mathcal\{P\}\_\{o\}\(y\_\{l\_\{j\}\}^\{t\}\>y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)=p\_\{j\}^\{t\}\\quad\\forall j\\in\[m\_\{t\}\],t\\in\[T\],i\.e\., workeroois always correct\. For every other workeri≠oi\\neq o, assume they always report the opposite signal:

𝒫^i\(yljt\>ylj′t∣xjt\)=1−pjt∀i≠o,j∈\[mt\],t∈\[T\]\.\\hat\{\\mathcal\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\>y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)=1\-p\_\{j\}^\{t\}\\quad\\forall i\\neq o,j\\in\[m\_\{t\}\],t\\in\[T\]\.AssumingN≥3N\\geq 3, theN−1N\-1adversarial workers form a majority\. Under the EM instantiations withw0\>0\.5w\_\{0\}\>0\.5, due to the existence of multiple stationary points of the likelihood function and the dependence of EM on initialization and sample realizations, there exists a realization \(and corresponding EM trajectory\) under which the algorithm reinforces the majority cluster because they are statistically consistent with each other\. Consequently, the inferred latent variable converges to the majority opinion:

𝒫^\(yljt\>ylj′t∣xjt\)=1−pjt\.\\hat\{\\mathcal\{P\}\}\(y\_\{l\_\{j\}\}^\{t\}\>y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\\mid x\_\{j\}^\{t\}\)=1\-p\_\{j\}^\{t\}\.Accordingly, the best fixed worker in hindsight isi∗=oi^\{\*\}=o, which yields

mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i−pjt\)2=0\.\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\mathcal\{P\}\_\{i\}\-p\_\{j\}^\{t\}\)^\{2\}=0\.However, the cumulative aggregation loss overTTslots is

∑t=1T1mt∑j=1mt\(𝒫^−pjt\)2=T\.\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\(\\hat\{\\mathcal\{P\}\}\-p\_\{j\}^\{t\}\)^\{2\}=T\.Finally, the regret of the EM\-based scheme is

R1\(T\)\\displaystyle R\_\{1\}\(T\)=T−0=𝒪\(T\),\\displaystyle=T\-0=\\mathcal\{O\}\(T\),which implieslimT→∞R1\(T\)T\>0\\lim\_\{T\\to\\infty\}\\frac\{R\_\{1\}\(T\)\}\{T\}\>0\. We finish the proof\.

## Appendix BProof of Lemma[2](https://arxiv.org/html/2605.24052#Thmlemma2)

First, we prove that the Hedge Scheme is not truthful\. We want to show that for any worker numberN≥2N\\geq 2and time horizonT≥2T\\geq 2, there exists a beliefq∈\(0,1\)∖\{12\}q\\in\(0,1\)\\setminus\\\{\\tfrac\{1\}\{2\}\\\}and a strategy profile of other workers such that workeriiachieves a strictly higher expected long\-term utility by misreporting in the first time slot\. Assumemt=1m\_\{t\}=1for allt∈\[T\]t\\in\[T\]\.

Let workeriihave private beliefq=Pr⁡\(p1=1\)q=\\Pr\(p^\{1\}=1\)withq∉\{0,12,1\}q\\notin\\\{0,\\tfrac\{1\}\{2\},1\\\}\. The worker’s long\-term utility is

ui=∑t=1Twit\.u\_\{i\}=\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\.The Hedge weight update rule is

wit\+1=wit⋅exp⁡\(−η\(P^it−pt\)2\),wi1=1,w\_\{i\}^\{t\+1\}=w\_\{i\}^\{t\}\\cdot\\exp\\\!\\left\(\-\\eta\\big\(\\hat\{P\}\_\{i\}^\{t\}\-p^\{t\}\\big\)^\{2\}\\right\),\\qquad w\_\{i\}^\{1\}=1,whereη\>0\\eta\>0is the learning rate\.

Due to the multiplicative structure, for anyt≥2t\\geq 2,

wit=wi2⋅∏τ=2t−1exp⁡\(−η\(P^iτ−pτ\)2\)\.w\_\{i\}^\{t\}=w\_\{i\}^\{2\}\\cdot\\prod\_\{\\tau=2\}^\{t\-1\}\\exp\\\!\\left\(\-\\eta\\big\(\\hat\{P\}\_\{i\}^\{\\tau\}\-p^\{\\tau\}\\big\)^\{2\}\\right\)\.Define the future weight multiplier

Mi:=1\+∑t=2T∏τ=2t−1exp⁡\(−η\(P^iτ−pτ\)2\),M\_\{i\}:=1\+\\sum\_\{t=2\}^\{T\}\\prod\_\{\\tau=2\}^\{t\-1\}\\exp\\\!\\left\(\-\\eta\\big\(\\hat\{P\}\_\{i\}^\{\\tau\}\-p^\{\\tau\}\\big\)^\{2\}\\right\),which satisfiesMi≥1M\_\{i\}\\geq 1\. Then the utility can be written as

ui=1\+wi2⋅Mi\.u\_\{i\}=1\+w\_\{i\}^\{2\}\\cdot M\_\{i\}\.Consider two reporting strategies for workeriiin slott=1t=1:

- •Truthful strategySTS\_\{T\}: reportP^i1=q\\hat\{P\}\_\{i\}^\{1\}=q\.
- •Deviating strategySDS\_\{D\}: reportP^i1=r≠q\\hat\{P\}\_\{i\}^\{1\}=r\\neq q, whererris chosen to maximize𝔼\[wi2\]\\mathbb\{E\}\[w\_\{i\}^\{2\}\]\.

We then have

𝔼\[ui∣S\]=1\+𝔼\[wi2∣S\]⋅𝔼\[Mi\],S∈\{ST,SD\}\.\\mathbb\{E\}\[u\_\{i\}\\mid S\]=1\+\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\]\\cdot\\mathbb\{E\}\[M\_\{i\}\],\\qquad S\\in\\\{S\_\{T\},S\_\{D\}\\\}\.
We analyze𝔼\[wi2\]\\mathbb\{E\}\[w\_\{i\}^\{2\}\]as a function of the first reportrr\. Sincep1∈\{0,1\}p^\{1\}\\in\\\{0,1\\\}withPr⁡\(p1=1\)=q\\Pr\(p^\{1\}=1\)=q, we have

𝔼\[wi2\]=qexp\(−η\(r−1\)2\)\+\(1−q\)exp\(−ηr2\)=:F\(r\)\.\\mathbb\{E\}\[w\_\{i\}^\{2\}\]=q\\exp\\\!\\left\(\-\\eta\(r\-1\)^\{2\}\\right\)\+\(1\-q\)\\exp\\\!\\left\(\-\\eta r^\{2\}\\right\)=:F\(r\)\.Its derivative is

F′\(r\)=−2η\[q\(r−1\)e−η\(r−1\)2\+\(1−q\)re−ηr2\]\.F^\{\\prime\}\(r\)=\-2\\eta\\left\[q\(r\-1\)e^\{\-\\eta\(r\-1\)^\{2\}\}\+\(1\-q\)re^\{\-\\eta r^\{2\}\}\\right\]\.Evaluating atr=qr=qyields

F′\(q\)=−2ηq\(1−q\)\(e−ηq2−e−η\(1−q\)2\)\.F^\{\\prime\}\(q\)=\-2\\eta\\,q\(1\-q\)\\left\(e^\{\-\\eta q^\{2\}\}\-e^\{\-\\eta\(1\-q\)^\{2\}\}\\right\)\.Ifq\>12q\>\\tfrac\{1\}\{2\}, thenq2\>\(1−q\)2q^\{2\}\>\(1\-q\)^\{2\}and thuse−ηq2<e−η\(1−q\)2e^\{\-\\eta q^\{2\}\}<e^\{\-\\eta\(1\-q\)^\{2\}\}, implyingF′\(q\)\>0F^\{\\prime\}\(q\)\>0\. Ifq<12q<\\tfrac\{1\}\{2\}, thenq2<\(1−q\)2q^\{2\}<\(1\-q\)^\{2\}and thuse−ηq2\>e−η\(1−q\)2e^\{\-\\eta q^\{2\}\}\>e^\{\-\\eta\(1\-q\)^\{2\}\}, implyingF′\(q\)<0F^\{\\prime\}\(q\)<0\. Therefore, forq≠12q\\neq\\tfrac\{1\}\{2\}, the truthful reportr=qr=qis not a local maximizer ofF\(r\)F\(r\); hence there existsr⋆≠qr^\{\\star\}\\neq qsuch that

𝔼\[wi2∣SD\]\>𝔼\[wi2∣ST\]\.\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\_\{D\}\]\>\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\_\{T\}\]\.Since𝔼\[Mi\]\>0\\mathbb\{E\}\[M\_\{i\}\]\>0, it follows that𝔼\[ui∣SD\]\>𝔼\[ui∣ST\]\\mathbb\{E\}\[u\_\{i\}\\mid S\_\{D\}\]\>\\mathbb\{E\}\[u\_\{i\}\\mid S\_\{T\}\], proving that the Hedge scheme is not truthful\.

## Appendix CProof of Lemma[3](https://arxiv.org/html/2605.24052#Thmlemma3)

We want to proveR3\(T\)=𝒪\(T\)R\_\{3\}\(T\)=\\mathcal\{O\}\(T\)with a possible sequence of workers’ preferences\. In particular, we consider𝒫o\(yljt≻ylj′t\|xjt\)=pjt\\mathcal\{P\}\_\{o\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=p\_\{j\}^\{t\}holds for one particularo∈\[N\]o\\in\[N\]with anyj∈\[mt\]j\\in\[m\_\{t\}\]andt∈\[T\]t\\in\[T\]\. Further, we consider\(𝒫^j,kmt−pjt\)2=cjt\(\\mathcal\{\\hat\{P\}\}\_\{j,k\_\{m\}\}^\{t\}\-p\_\{j\}^\{t\}\)^\{2\}=c\_\{j\}^\{t\}forj∈\[mt\]j\\in\[m\_\{t\}\]andt∈\[T\]t\\in\[T\], where𝒫^j,kmt\\mathcal\{\\hat\{P\}\}\_\{j,k\_\{m\}\}^\{t\}denotes the median of workers’ feedback\{𝒫^i\(yljt≻ylj′t\|xjt\)\}i=1N\\\{\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\\}\_\{i=1\}^\{N\}andcjt∈\[12,1\]c\_\{j\}^\{t\}\\in\[\\frac\{1\}\{2\},1\]\. Accordingly, we have the best\-fixed worker in hindsight isi∗=oi^\{\*\}=o, which brings

mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2=0\.\\displaystyle\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}=0\.However, with the platform’s median scheme, we have the cumulative aggregation loss overTTslots as follows:

∑t=1T1mt∑j=1mt\(∑i=1Nwit𝒫^i\(yljt≻ylj′t\|xjt\)∑i′=1Nwi′t−pjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\-p\_\{j\}^\{t\}\\bigg\)^\{2\}=\\displaystyle=∑t=1T1mt∑j=1mt\(𝒫^j,kmt−yjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{\\hat\{P\}\}\_\{j,k\_\{m\}\}^\{t\}\-y\_\{j\}^\{t\}\\bigg\)^\{2\}=\\displaystyle=∑t=1T1mt∑j=1mtcjt=𝒪\(T\),\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}c\_\{j\}^\{t\}=\\mathcal\{O\}\(T\),where the last equality holds because eachcjt∈\[12,1\]c\_\{j\}^\{t\}\\in\[\\frac\{1\}\{2\},1\]and∑t=1T1mt∑j=1mtcjt\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}c\_\{j\}^\{t\}does not vanish asT→∞T\\to\\infty\. Finally, we have the regret of the median scheme as follows:

R3\(T\)=\\displaystyle R\_\{3\}\(T\)=∑t=1T1mt∑j=1mt\(∑i=1Nwit𝒫^i\(yljt≻ylj′t\|xjt\)∑i′=1Nwi′t−pjt\)2\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\sum\_\{i=1\}^\{N\}\\frac\{w\_\{i\}^\{t\}\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\-p\_\{j\}^\{t\}\\bigg\)^\{2\}−mini∈\[N\]∑t=1T1mt∑j=1mt\(𝒫i\(yljt≻ylj′t\|xjt\)−pjt\)2\\displaystyle\-\\min\_\{i\\in\[N\]\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-p\_\{j\}^\{t\}\\bigg\)^\{2\}=\\displaystyle=𝒪\(T\)\.\\displaystyle\\mathcal\{O\}\(T\)\.We then finish the proof\.

## Appendix DProof of Proposition[1](https://arxiv.org/html/2605.24052#Thmproposition1)

Sincegi\(⋅\)g\_\{i\}\(\\cdot\)is strictly increasing, maximizing the utilityui=gi\(𝔼\[∑t=1Twit\]\)u\_\{i\}=g\_\{i\}\\\!\\left\(\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\right\]\\right\)is equivalent to maximizing the expected cumulative weight𝔼\[∑t=1Twit\]\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\right\]\. According to our system model in Section[III](https://arxiv.org/html/2605.24052#S3), each worker believes thatpjtp\_\{j\}^\{t\}∼\\simBernoulli\(𝒫i\(yljt\\texttt\{Bernoulli\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}≻\\succylj′t\|xjt\)\)y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\), we have expectation onwit\+1w\_\{i\}^\{t\+1\}in \([6](https://arxiv.org/html/2605.24052#S5.E6)\) overpjtp\_\{j\}^\{t\}is

𝔼\[wit\+1\]\\displaystyle\\mathbb\{E\}\[w\_\{i\}^\{t\+1\}\]=\\displaystyle=wit1mt∑j=1mt\[1−α𝒫i\(yljt≻ylj′t\|xjt\)\(𝒫^i\(yljt≻ylj′t\|xjt\)−1\)2\\displaystyle w\_\{i\}^\{t\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\[1\-\\alpha\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-1\)^\{2\}−α\(1−𝒫i\(yljt≻ylj′t\|xjt\)\)\(𝒫^i\(yljt≻ylj′t\|xjt\)−0\)2\]\\displaystyle\-\\alpha\(1\-\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\)\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-0\)^\{2\}\\bigg\]=\\displaystyle=wit1mt∑j=1mt\[1−α\(𝒫^i\(yljt≻ylj′t\|xjt\)−𝒫i\(yljt≻ylj′t\|xjt\)\)2\\displaystyle w\_\{i\}^\{t\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\[1\-\\alpha\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\)^\{2\}−α\(𝒫i\(yljt≻ylj′t\|xjt\)−𝒫i2\(yljt≻ylj′t\|xjt\)\)\],\\displaystyle\-\\alpha\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-\\mathcal\{P\}\_\{i\}^\{2\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\)\\bigg\],which is maximized at𝒫^i∗\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}^\{\*\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\. To obtain the largest possible accumulative weight, each worker will truthfully feedback his preference in the first time slot and all the following time slots because any deviation will lead to smaller weights of the next and all the following time slots\. We finish the proof\.

## Appendix EProof of Lemma[4](https://arxiv.org/html/2605.24052#Thmlemma4)

First, we prove that the EXP3 scheme is not truthful\. We aim to show that for any number of workersN≥2N\\geq 2, there exists a time horizonT≥2T\\geq 2, a beliefq∈\(0,1\)∖\{12\}q\\in\(0,1\)\\setminus\\\{\\frac\{1\}\{2\}\\\}and a strategy profile of other workers such that workeriiachieves a strictly higher expected long\-term utility by misreporting in the first time slot\.

Consider the case wheremt=1m\_\{t\}=1for allt∈\[T\]t\\in\[T\]\. Let workeriihave a private beliefq=𝒫i1=Pr⁡\(p1=1\)q=\\mathcal\{P\}\_\{i\}^\{1\}=\\Pr\(p^\{1\}=1\), withq∉\{0,12,1\}q\\notin\\\{0,\\frac\{1\}\{2\},1\\\}\. The worker’s long\-term utility is defined as the sum of its weights:

ui=∑t=1Twit\.u\_\{i\}=\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\.
The EXP3 weight update rule for a selected workerItI\_\{t\}is:

wit\+1=\{wit⋅exp⁡\(−η⋅ℓ~it\),ifi=It,wit,otherwise,w\_\{i\}^\{t\+1\}=\\begin\{cases\}w\_\{i\}^\{t\}\\cdot\\exp\\left\(\-\\eta\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\\right\),&\\text\{if \}i=I\_\{t\},\\\\ w\_\{i\}^\{t\},&\\text\{otherwise\},\\end\{cases\}with the unbiased loss estimatorℓ~it\\tilde\{\\ell\}\_\{i\}^\{t\}given by:

ℓ~it=\{ℓ^it\(1−β\)θit\+β/N,ifi=It,0,otherwise,\\tilde\{\\ell\}\_\{i\}^\{t\}=\\begin\{cases\}\\frac\{\\hat\{\\ell\}\_\{i\}^\{t\}\}\{\(1\-\\beta\)\\theta\_\{i\}^\{t\}\+\\beta/N\},&\\text\{if \}i=I\_\{t\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}whereℓ^it=\(𝒫^it−pt\)2\\hat\{\\ell\}\_\{i\}^\{t\}=\(\\hat\{\\mathcal\{P\}\}\_\{i\}^\{t\}\-p^\{t\}\)^\{2\}andθit=wit∑i′=1Nwi′t\\theta\_\{i\}^\{t\}=\\frac\{w\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}w\_\{i^\{\\prime\}\}^\{t\}\}\.

Due to the multiplicative structure of the update, the weight at any timet≥2t\\geq 2depends on the sequence of selections and reports\. The utility can be expressed as:

ui=1\+∑t=2Twit\.u\_\{i\}=1\+\\sum\_\{t=2\}^\{T\}w\_\{i\}^\{t\}\.
Now, consider two reporting strategies for workeriiin the first time slott=1t=1:

- •*Truthful strategySTS\_\{T\}*: Report𝒫^i1=q\\hat\{\\mathcal\{P\}\}\_\{i\}^\{1\}=q\.
- •*Deviating strategySDS\_\{D\}*: Report𝒫^i1=r∗≠q\\hat\{\\mathcal\{P\}\}\_\{i\}^\{1\}=r^\{\*\}\\neq q, wherer∗r^\{\*\}is chosen to maximize the expected weight in the subsequent slots\.

To prove that the EXP3 scheme is not truthful, it is sufficient to exhibit a counter\-example for some finite horizonT≥2T\\geq 2\. In particular, it is enough to takeT=2T=2\. In this case, the worker’s long\-term utility simplifies to

ui=wi1\+wi2=1\+wi2,u\_\{i\}=w\_\{i\}^\{1\}\+w\_\{i\}^\{2\}=1\+w\_\{i\}^\{2\},sincewi1=1w\_\{i\}^\{1\}=1for allii\. Therefore, forT=2T=2, maximizing the expected long\-term utility𝔼\[ui\|S\]\\mathbb\{E\}\[u\_\{i\}\|S\]is equivalent to maximizing the expected weight after the first slot,𝔼\[wi2∣S\]\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\], forS∈\{ST,SD\}S\\in\\\{S\_\{T\},S\_\{D\}\\\}\.

To show that deviation is profitable, we analyze𝔼\[wi2∣S\]\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\]\. Note thatwi2w\_\{i\}^\{2\}only changes if workeriiis selected int=1t=1\(i\.e\.,I1=iI\_\{1\}=i\)\. Letθi1=1N\\theta\_\{i\}^\{1\}=\\frac\{1\}\{N\}be the initial selection probability\.

IfI1=iI\_\{1\}=i, then:

wi2=wi1⋅exp⁡\(−η⋅ℓ~i1\)\\displaystyle w\_\{i\}^\{2\}=w\_\{i\}^\{1\}\\cdot\\exp\\left\(\-\\eta\\cdot\\tilde\{\\ell\}\_\{i\}^\{1\}\\right\)=exp⁡\(−η⋅ℓ^i1\(1−β\)θi1\+β/N\)\\displaystyle=\\exp\\left\(\-\\eta\\cdot\\frac\{\\hat\{\\ell\}\_\{i\}^\{1\}\}\{\(1\-\\beta\)\\theta\_\{i\}^\{1\}\+\\beta/N\}\\right\)=exp⁡\(−η⋅ℓ^i1\(1−β\)⋅1N\+βN\)\\displaystyle=\\exp\\left\(\-\\eta\\cdot\\frac\{\\hat\{\\ell\}\_\{i\}^\{1\}\}\{\(1\-\\beta\)\\cdot\\frac\{1\}\{N\}\+\\frac\{\\beta\}\{N\}\}\\right\)=exp⁡\(−ηN⋅ℓ^i1\)\.\\displaystyle=\\exp\\left\(\-\\eta N\\cdot\\hat\{\\ell\}\_\{i\}^\{1\}\\right\)\.IfI1≠iI\_\{1\}\\neq i, thenwi2=wi1=1w\_\{i\}^\{2\}=w\_\{i\}^\{1\}=1\.

Therefore, the expected weight after the first slot is:

𝔼\[wi2∣S\]=θi1⋅𝔼\[exp⁡\(−ηN⋅ℓ^i1\)∣S\]\+\(1−θi1\)⋅1,\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\]=\\theta\_\{i\}^\{1\}\\cdot\\mathbb\{E\}\\\!\\left\[\\exp\\left\(\-\\eta N\\cdot\\hat\{\\ell\}\_\{i\}^\{1\}\\right\)\\mid S\\right\]\+\(1\-\\theta\_\{i\}^\{1\}\)\\cdot 1,i\.e\.,

𝔼\[wi2∣S\]=1N⋅𝔼\[exp⁡\(−ηN⋅ℓ^i1\)∣S\]\+N−1N\.\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\]=\\frac\{1\}\{N\}\\cdot\\mathbb\{E\}\\\!\\left\[\\exp\\left\(\-\\eta N\\cdot\\hat\{\\ell\}\_\{i\}^\{1\}\\right\)\\mid S\\right\]\+\\frac\{N\-1\}\{N\}\.
Define

F\(r\)=𝔼\[exp\(−ηN⋅ℓ^i1\)\|𝒫^i1=r\]\.F\(r\)=\\mathbb\{E\}\\\!\\left\[\\exp\\left\(\-\\eta N\\cdot\\hat\{\\ell\}\_\{i\}^\{1\}\\right\)\\,\\middle\|\\,\\hat\{\\mathcal\{P\}\}\_\{i\}^\{1\}=r\\right\]\.We have:

F\(r\)=q⋅exp⁡\(−ηN\(r−1\)2\)\+\(1−q\)⋅exp⁡\(−ηNr2\)\.F\(r\)=q\\cdot\\exp\\left\(\-\\eta N\(r\-1\)^\{2\}\\right\)\+\(1\-q\)\\cdot\\exp\\left\(\-\\eta Nr^\{2\}\\right\)\.
Its derivative is:

F′\(r\)=−2ηN\\displaystyle F^\{\\prime\}\(r\)=\-2\\eta N\[q\(r−1\)exp⁡\(−ηN\(r−1\)2\)\+\(1−q\)rexp⁡\(−ηNr2\)\]\.\\displaystyle\\left\[q\(r\-1\)\\exp\\left\(\-\\eta N\(r\-1\)^\{2\}\\right\)\+\(1\-q\)r\\exp\\left\(\-\\eta Nr^\{2\}\\right\)\\right\]\.
Evaluating at the truthful reportr=qr=q:

F′\(q\)=−2ηN⋅q\(1−q\)\[exp⁡\(−ηNq2\)−exp⁡\(−ηN\(1−q\)2\)\]\.F^\{\\prime\}\(q\)=\-2\\eta N\\cdot q\(1\-q\)\\left\[\\exp\\left\(\-\\eta Nq^\{2\}\\right\)\-\\exp\\left\(\-\\eta N\(1\-q\)^\{2\}\\right\)\\right\]\.
Forq\>12q\>\\frac\{1\}\{2\}, we have\(1−q\)2<q2\(1\-q\)^\{2\}<q^\{2\}and henceexp⁡\(−ηNq2\)<exp⁡\(−ηN\(1−q\)2\)\\exp\\left\(\-\\eta Nq^\{2\}\\right\)<\\exp\\left\(\-\\eta N\(1\-q\)^\{2\}\\right\), soF′\(q\)\>0F^\{\\prime\}\(q\)\>0\. Forq<12q<\\frac\{1\}\{2\}, we have\(1−q\)2\>q2\(1\-q\)^\{2\}\>q^\{2\}and henceexp⁡\(−ηNq2\)\>exp⁡\(−ηN\(1−q\)2\)\\exp\\left\(\-\\eta Nq^\{2\}\\right\)\>\\exp\\left\(\-\\eta N\(1\-q\)^\{2\}\\right\), soF′\(q\)<0F^\{\\prime\}\(q\)<0\.

Thus, forq≠12q\\neq\\frac\{1\}\{2\}, the truthful reportr=qr=qis not a local maximum ofF\(r\)F\(r\)\. There exists somer∗≠qr^\{\*\}\\neq qsuch thatF\(r∗\)\>F\(q\)F\(r^\{\*\}\)\>F\(q\)\. Consequently,

𝔼\[wi2∣SD\]\>𝔼\[wi2∣ST\]\.\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\_\{D\}\]\>\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\_\{T\}\]\.
ForT=2T=2, the expected utility is

𝔼\[ui∣S\]=1\+𝔼\[wi2∣S\],\\mathbb\{E\}\[u\_\{i\}\\mid S\]=1\+\\mathbb\{E\}\[w\_\{i\}^\{2\}\\mid S\],so a higher expectedwi2w\_\{i\}^\{2\}directly implies a higher expected long\-term utility\. Therefore,

𝔼\[ui∣SD\]\>𝔼\[ui∣ST\],\\mathbb\{E\}\[u\_\{i\}\\mid S\_\{D\}\]\>\\mathbb\{E\}\[u\_\{i\}\\mid S\_\{T\}\],which proves that the EXP3 scheme is not truthful\.

## Appendix FProof of Proposition[4](https://arxiv.org/html/2605.24052#Thmproposition4)

Sincegi\(⋅\)g\_\{i\}\(\\cdot\)is strictly increasing, maximizing the utilityui=gi\(𝔼\[∑t=1Twit\]\)u\_\{i\}=g\_\{i\}\\\!\\left\(\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\right\]\\right\)is equivalent to maximizing the expected cumulative weight𝔼\[∑t=1Twit\]\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}w\_\{i\}^\{t\}\\right\]\. According to our system model in Section[VI](https://arxiv.org/html/2605.24052#S6), each worker believes thatpjt∼Bernoulli\(𝒫i\(yljt≻ylj′t\|xjt\)\)p\_\{j\}^\{t\}\\sim\\texttt\{Bernoulli\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\)andPr\(It=i\)=θitPr\(I\_\{t\}=i\)=\\theta\_\{i\}^\{t\}, we have expectation onwit\+1w\_\{i\}^\{t\+1\}in \([14](https://arxiv.org/html/2605.24052#S6.E14)\) overpjtp\_\{j\}^\{t\}andItI\_\{t\}is

𝔼\[wit\+1\]\\displaystyle\\mathbb\{E\}\[w\_\{i\}^\{t\+1\}\]=\\displaystyle=\(1−β\)γit1mt∑j=1mt\[1−α\(1−αθit\)\(𝒫i\(yljt≻ylj′t\|xjt\)\\displaystyle\(1\-\\beta\)\{\\gamma\_\{i\}^\{t\}\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\[1\-\\alpha\\bigg\(1\-\\frac\{\\alpha\}\{\\theta\_\{i\}^\{t\}\}\\bigg\)\\bigg\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\(𝒫^i\(yljt≻ylj′t\|xjt\)−1\)2\+\(1−𝒫i\(yljt≻ylj′t\|xjt\)\)\\displaystyle\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-1\)^\{2\}\+\(1\-\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\)\(𝒫^i\(yljt≻ylj′t\|xjt\)−0\)2\)\]\+β\\displaystyle\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\-0\)^\{2\}\\bigg\)\\bigg\]\+\\beta=\\displaystyle=\(1−β\)γit1mt∑j=1mt\[1−α\(1−αθit\)\(\(𝒫^i\(yljt≻ylj′t\|xjt\)\\displaystyle\(1\-\\beta\)\{\\gamma\_\{i\}^\{t\}\}\\frac\{1\}\{m\_\{t\}\}\\sum\_\{j=1\}^\{m\_\{t\}\}\\bigg\[1\-\\alpha\\bigg\(1\-\\frac\{\\alpha\}\{\\theta\_\{i\}^\{t\}\}\\bigg\)\\bigg\(\\big\(\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)−𝒫i\(yljt≻ylj′t\|xjt\)\)2\+\(𝒫i\(yljt≻ylj′t\|xjt\)\\displaystyle\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \-\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\big\)^\{2\}\+\\bigg\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)−𝒫i2\(yljt≻ylj′t\|xjt\)\)\)\]\+β,\\displaystyle\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \-\\mathcal\{P\}\_\{i\}^\{2\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\\bigg\)\\bigg\)\\bigg\]\+\\beta,which is maximized at𝒫^i∗\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}^\{\*\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)ifα<θit\\alpha<\\theta\_\{i\}^\{t\}\. According to the choices ofα\\alphaandβ\\betain Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2), we haveθit≥βN=2α\>α\\theta\_\{i\}^\{t\}\\geq\\frac\{\\beta\}\{N\}=2\\alpha\>\\alpha, implying𝒫^i∗\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}^\{\*\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)\. To obtain the largest possible cumulative weight, each worker will truthfully feedback his preference in the first time slot and all the following time slots because any deviation will lead to smaller weights of the next and all the following time slots\. We then finish the proof\.

## Appendix GProof of Theorem[2](https://arxiv.org/html/2605.24052#Thmtheorem2)

According to Proposition[4](https://arxiv.org/html/2605.24052#Thmproposition4), we have𝒫^i\(yljt≻ylj′t\|xjt\)=𝒫i\(yljt≻ylj′t\|xjt\)\\mathcal\{\\hat\{P\}\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)=\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{t\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{t\}\|x\_\{j\}^\{t\}\)for allj∈\[mt\]j\\in\[m\_\{t\}\],i∈\[N\]i\\in\[N\]andt∈\[T\]t\\in\[T\]\. To derive a lower\-bound onln⁡∑i=1NγiT\+1∑i=1Nγi1\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{1\}\}, we assumeαN≤β2\\alpha N\\leq\\frac\{\\beta\}\{2\}and have

ln⁡∑i=1NγiT\+1∑i=1Nγi1=\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{1\}\}=ln⁡\(∑i=1NγiT\+1\)−ln⁡\(∑i=1Nγi1\)\\displaystyle\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\\bigg\)\-\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{1\}\\bigg\)=\\displaystyle=ln⁡\(∑i=1N∏t=1T\(1−αℓ~it\)\)−ln⁡N\\displaystyle\\ln\\bigg\(\\sum\_\{i=1\}^\{N\}\\prod\_\{t=1\}^\{T\}\(1\-\\alpha\\tilde\{\\ell\}\_\{i\}^\{t\}\)\\bigg\)\-\\ln N≥\\displaystyle\\geqln⁡\(∏t=1T\(1−αℓ~i∗t\)\)−ln⁡N\\displaystyle\\ln\\bigg\(\\prod\_\{t=1\}^\{T\}\(1\-\\alpha\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\)\\bigg\)\-\\ln N=\\displaystyle=∑t=1Tln⁡\(1−αℓ~i∗t\)−ln⁡N\\displaystyle\\sum\_\{t=1\}^\{T\}\\ln\\bigg\(1\-\\alpha\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\)\-\\ln N≥\\displaystyle\\geq−α∑t=1Tℓ~i∗t−α2∑t=1T\(ℓ~i∗t\)2−ln⁡N,\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\-\\alpha^\{2\}\\sum\_\{t=1\}^\{T\}\\bigg\(\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\)^\{2\}\-\\ln N,\(16\)where we choose

ℓ~it=\{ℓit\(1−α/θit\)θit,ifi=It,0,otherwise,\\displaystyle\\tilde\{\\ell\}\_\{i\}^\{t\}=\\begin\{cases\}\\frac\{\\ell\_\{i\}^\{t\}\(1\-\\alpha/\\theta\_\{i\}^\{t\}\)\}\{\\theta\_\{i\}^\{t\}\},&\\text\{if\}\\ i=I\_\{t\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}αℓ~it≤αℓit\(1−α/θit\)θit≤α1θit≤αNβ≤12\\alpha\\tilde\{\\ell\}\_\{i\}^\{t\}\\leq\\alpha\\frac\{\\ell\_\{i\}^\{t\}\(1\-\\alpha/\\theta\_\{i\}^\{t\}\)\}\{\\theta\_\{i\}^\{t\}\}\\leq\\alpha\\frac\{1\}\{\\theta\_\{i\}^\{t\}\}\\leq\\alpha\\frac\{N\}\{\\beta\}\\leq\\frac\{1\}\{2\}and denotei∗i^\{\*\}as the best worker in hindsight\. Note thatθit≥βN\\theta\_\{i\}^\{t\}\\geq\\frac\{\\beta\}\{N\}is equal to\(γit\+β\)N≥β∑i=1Nγit\(\\gamma\_\{i\}^\{t\}\+\\beta\)N\\geq\\beta\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}, which holds due toγit\+β\>β\>0\\gamma\_\{i\}^\{t\}\+\\beta\>\\beta\>0andN\>∑i=1NγitN\>\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\. The first inequality holds due toαℓ~it≤12\\alpha\\tilde\{\\ell\}\_\{i\}^\{t\}\\leq\\frac\{1\}\{2\}for alli∈\[N\]i\\in\[N\]andt∈\[T\]t\\in\[T\]\. The second inequality holds due toln⁡\(1−x\)≥−x−x2\\ln\(1\-x\)\\geq\-x\-x^\{2\}forx≤12x\\leq\\frac\{1\}\{2\}\.

To derive an upper\-bound onln⁡∑i=1Nγit\+1∑i=1Nγit\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}, we have

ln⁡∑i=1Nγit\+1∑i=1Nγit=\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}=ln⁡\(∑i=1Nγit⋅\(1−αℓ~it\)∑i′=1Nγi′t\)\\displaystyle\\ln\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\(1\-\\alpha\\tilde\{\\ell\}\_\{i\}^\{t\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\bigg\)=\\displaystyle=ln⁡\(1−α∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\)\\displaystyle\\ln\\bigg\(1\-\\alpha\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\bigg\)≤\\displaystyle\\leq−α∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\+12α2\(∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\)2\\displaystyle\-\\alpha\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{1\}\{2\}\\alpha^\{2\}\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\bigg\)^\{2\}≤\\displaystyle\\leq−α∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\+12α2∑i=1Nγit⋅\(ℓ~it\)2∑i′=1Nγi′t,\\displaystyle\-\\alpha\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{1\}\{2\}\\alpha^\{2\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\(\\tilde\{\\ell\}\_\{i\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\},\(17\)where the first inequality holds due toln⁡\(1−αx\)≤−αx\+12α2x2\\ln\(1\-\\alpha x\)\\leq\-\\alpha x\+\\frac\{1\}\{2\}\\alpha^\{2\}x^\{2\}forx=∑i=1Nγit⋅ℓ~it∑i′=1Nγi′tx=\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}andαx≤1/2\\alpha x\\leq 1/2\. The second inequality holds due to Jensen’s inequality\. According to \([17](https://arxiv.org/html/2605.24052#A7.E17)\), we have

ln⁡∑i=1NγiT\+1∑i=1Nγi1\\displaystyle\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{1\}\}=\\displaystyle=ln⁡\(∑i=1NγiT\+1∑i=1Nγit∑i=1Nγit∑i=1Nγit−1⋅⋯⋅∑i=1Nγi2∑i=1Nγi1\)\\displaystyle\\ln\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\-1\}\}\\cdot\\cdots\\cdot\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{2\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{1\}\}\\bigg\)=\\displaystyle=∑t=1Tln⁡∑i=1NγiT\+1∑i=1Nγit\\displaystyle\\sum\_\{t=1\}^\{T\}\\ln\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{T\+1\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}≤\\displaystyle\\leq−α∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\+12α2∑t=1T∑i=1Nγit⋅\(ℓ~it\)2∑i′=1Nγi′t\.\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{1\}\{2\}\\alpha^\{2\}\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\(\\tilde\{\\ell\}\_\{i\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\.\(18\)According to \([16](https://arxiv.org/html/2605.24052#A7.E16)\) and \([18](https://arxiv.org/html/2605.24052#A7.E18)\), we have

−α∑t=1Tℓ~i∗t−α2∑t=1T\(ℓ~i∗t\)2−ln⁡N\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\-\\alpha^\{2\}\\sum\_\{t=1\}^\{T\}\\bigg\(\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\)^\{2\}\-\\ln N≤\\displaystyle\\leq−α∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t\+12α2∑t=1T∑i=1Nγit⋅\(ℓ~it\)2∑i′=1Nγi′t\.\\displaystyle\-\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\+\\frac\{1\}\{2\}\\alpha^\{2\}\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\(\\tilde\{\\ell\}\_\{i\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\.After re\-arranging the above inequalities and dividingα\\alphaon both sides, we have

∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t−∑t=1Tℓ~i∗t\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}≤\\displaystyle\\leqln⁡Nα\+12α∑t=1T∑i=1Nγit⋅\(ℓ~it\)2∑i′=1Nγi′t\+α∑t=1T\(ℓ~i∗t\)2\.\\displaystyle\\frac\{\\ln N\}\{\\alpha\}\+\\frac\{1\}\{2\}\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\(\\tilde\{\\ell\}\_\{i\}^\{t\}\)^\{2\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\+\\alpha\\sum\_\{t=1\}^\{T\}\\bigg\(\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\)^\{2\}\.After taking expectation ofℓ~it\\tilde\{\\ell\}\_\{i\}^\{t\}on the above inequality, we have

𝔼\[∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t−∑t=1Tℓ~i∗t\]\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\]≤\\displaystyle\\leqln⁡Nα\+12α∑t=1T∑i=1Nγit∑i′=1Nγi′t\(ℓit\)2θit\+α∑t=1T\(ℓi∗t\)2θit\\displaystyle\\frac\{\\ln N\}\{\\alpha\}\+\\frac\{1\}\{2\}\\alpha\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{\(\\ell\_\{i\}^\{t\}\)^\{2\}\}\{\\theta\_\{i\}^\{t\}\}\+\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\(\\ell\_\{i^\{\*\}\}^\{t\}\)^\{2\}\}\{\\theta\_\{i\}^\{t\}\}≤\\displaystyle\\leqln⁡Nα\+12αNT\+α∑t=1T\(ℓi∗t\)2θit,\\displaystyle\\frac\{\\ln N\}\{\\alpha\}\+\\frac\{1\}\{2\}\\alpha NT\+\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\(\\ell\_\{i^\{\*\}\}^\{t\}\)^\{2\}\}\{\\theta\_\{i\}^\{t\}\},\(19\)where the second inequality holds due toℓit∈\[0,1\]\\ell\_\{i\}^\{t\}\\in\[0,1\]fori∈\[N\]i\\in\[N\],θit≥min⁡\{γit∑i′=1Nγi′t,1N\}\\theta\_\{i\}^\{t\}\\geq\\min\\\{\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\},\\frac\{1\}\{N\}\\\}and∑t=1T∑i=1Nγit∑i′=1Nγi′t1θit≤∑t=1T∑i=1Nγit∑i′=1Nγi′t1min⁡\{γit∑i′=1Nγi′t,1N\}=NT\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{1\}\{\\theta\_\{i\}^\{t\}\}\\leq\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{1\}\{\\min\\\{\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\},\\frac\{1\}\{N\}\\\}\}=NT\. We then derive a lower bound of the expectation of∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t−∑t=1Tℓ~i∗t\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\. After taking expectation ofℓ^it\\hat\{\\ell\}\_\{i\}^\{t\}, we have

𝔼\[∑t=1T∑i=1Nγit⋅ℓ~it∑i′=1Nγi′t−∑t=1Tℓ~i∗t\]\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}\\frac\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\\cdot\\tilde\{\\ell\}\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\ell\}\_\{i^\{\*\}\}^\{t\}\\bigg\]=\\displaystyle=∑t=1T∑i=1Nγit∑i′=1Nγi′tℓit\(1−αθit\)−∑t=1Tℓi∗t\(1−αθit\)\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\ell\_\{i\}^\{t\}\\bigg\(1\-\\frac\{\\alpha\}\{\\theta\_\{i\}^\{t\}\}\\bigg\)\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\\bigg\(1\-\\frac\{\\alpha\}\{\\theta\_\{i\}^\{t\}\}\\bigg\)=\\displaystyle=∑t=1T∑i=1Nγit∑i′=1Nγi′tℓit−∑t=1Tℓi∗t−∑t=1T∑i=1Nγit∑i′=1Nγi′tαℓitθit\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{\\alpha\\ell\_\{i\}^\{t\}\}\{\\theta\_\{i\}^\{t\}\}\+∑t=1Tℓi∗tαθit\\displaystyle\+\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\\frac\{\\alpha\}\{\\theta\_\{i\}^\{t\}\}≥\\displaystyle\\geq∑t=1T∑i=1Nγit∑i′=1Nγi′tℓit−∑t=1Tℓi∗t−αNT\+α∑t=1T\(ℓi∗t\)2θit,\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\-\\alpha NT\+\\alpha\\sum\_\{t=1\}^\{T\}\\frac\{\(\\ell\_\{i^\{\*\}\}^\{t\}\)^\{2\}\}\{\\theta\_\{i\}^\{t\}\},\(20\)where the first equality holds due toℓit∈\[0,1\]\\ell\_\{i\}^\{t\}\\in\[0,1\]fori∈\[N\]i\\in\[N\],θit≥min⁡\{γit∑i′=1Nγi′t,1N\}\\theta\_\{i\}^\{t\}\\geq\\min\\\{\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\},\\frac\{1\}\{N\}\\\}and∑t=1T∑i=1Nγit∑i′=1Nγi′t1θit≤∑t=1T∑i=1Nγit∑i′=1Nγi′t1min⁡\{γit∑i′=1Nγi′t,1N\}=NT\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{1\}\{\\theta\_\{i\}^\{t\}\}\\leq\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\frac\{1\}\{\\min\\\{\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\},\\frac\{1\}\{N\}\\\}\}=NT\. According to \([G](https://arxiv.org/html/2605.24052#A7.Ex132)\) and \([G](https://arxiv.org/html/2605.24052#A7.Ex134)\), we have

∑t=1T∑i=1Nγit∑i′=1Nγi′tℓit−∑t=1Tℓi∗t≤32αNT\+ln⁡Nα\.\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\\leq\\frac\{3\}\{2\}\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\}\.Sinceθit=\(1−β\)γit\+β\(1−β\)∑i=1Nγit\+βN<γit∑i=1Nγit\+1N\\theta\_\{i\}^\{t\}=\\frac\{\(1\-\\beta\)\\gamma\_\{i\}^\{t\}\+\\beta\}\{\(1\-\\beta\)\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\+\\beta N\}<\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{i\}^\{t\}\}\+\\frac\{1\}\{N\}, we further have

∑t=1T∑i=1N\(θit−1N\)ℓit−∑t=1Tℓi∗t\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\bigg\(\\theta\_\{i\}^\{t\}\-\\frac\{1\}\{N\}\\bigg\)\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}<\\displaystyle<∑t=1T∑i=1Nγit∑i′=1Nγi′tℓit−∑t=1Tℓi∗t\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\frac\{\\gamma\_\{i\}^\{t\}\}\{\\sum\_\{i^\{\\prime\}=1\}^\{N\}\\gamma\_\{i^\{\\prime\}\}^\{t\}\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}≤\\displaystyle\\leq3αNT\+ln⁡Nα,\\displaystyle 3\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\},which is equal to

∑t=1T∑i=1Nθitℓit−∑t=1Tℓi∗t\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\theta\_\{i\}^\{t\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}≤∑i=1N∑t=1T1Nℓit\+3αNT\+ln⁡Nα\\displaystyle\\leq\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{N\}\\ell\_\{i\}^\{t\}\+3\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\}≤∑i=1N∑t=1T2βNℓit\+3αNT\+ln⁡Nα\\displaystyle\\leq\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\}\\frac\{2\\beta\}\{N\}\\ell\_\{i\}^\{t\}\+3\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\}≤2βT\+3αNT\+ln⁡Nα,\\displaystyle\\leq 2\\beta T\+3\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\},where the second inequality holds due toβ≤12\\beta\\leq\\frac\{1\}\{2\}and the third due toℓit≤1\\ell\_\{i\}^\{t\}\\leq 1\. By takingβ=2αN\\beta=2\\alpha N, we have

Rℳ\(T\)\\displaystyle R\_\{\\mathcal\{M\}\}\(T\)=∑t=1T∑i=1Nθitℓit−∑t=1Tℓi∗t≤7αNT\+ln⁡Nα\\displaystyle=\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{N\}\\theta\_\{i\}^\{t\}\\ell\_\{i\}^\{t\}\-\\sum\_\{t=1\}^\{T\}\\ell\_\{i^\{\*\}\}^\{t\}\\leq 7\\alpha NT\+\\frac\{\\ln N\}\{\\alpha\}≤27NTln⁡N=𝒪\(T\)\\displaystyle\\leq 2\\sqrt\{7\}\\sqrt\{NT\\ln N\}=\\mathcal\{O\}\(\\sqrt\{T\}\)atα=ln⁡N7NT\\alpha=\\sqrt\{\\frac\{\\ln N\}\{7NT\}\}\. Now let us check the condition forβ≤12\\beta\\leq\\frac\{1\}\{2\}, which is equal toT\>47Nln⁡NT\>\\frac\{4\}\{\\sqrt\{7\}\}N\\ln Nand holds forT→∞T\\to\\infty\. Note thatβ=2αN\\beta=2\\alpha Nsatisfies the condition ofαNβ≤12\\alpha\\frac\{N\}\{\\beta\}\\leq\\frac\{1\}\{2\}\. We then finish the proof\.

## Appendix HProof of Proposition[2](https://arxiv.org/html/2605.24052#Thmproposition2)

By Proposition[1](https://arxiv.org/html/2605.24052#Thmproposition1), workers report truthfully, so𝒫^i=𝒫i\\hat\{\\mathcal\{P\}\}\_\{i\}=\\mathcal\{P\}\_\{i\}andLis:=1ms∑j=1ms\(𝒫i\(yljs≻ylj′s\|xjs\)−pjs\)2∈\[0,1\]L\_\{i\}^\{s\}:=\\frac\{1\}\{m\_\{s\}\}\\sum\_\{j=1\}^\{m\_\{s\}\}\(\\mathcal\{P\}\_\{i\}\(y\_\{l\_\{j\}\}^\{s\}\\succ y\_\{l\_\{j\}^\{\\prime\}\}^\{s\}\|x\_\{j\}^\{s\}\)\-p\_\{j\}^\{s\}\)^\{2\}\\in\[0,1\]is workerii’s realized per\-slot squared feedback loss, with expectationℓis=𝔼\[Lis\]\\ell\_\{i\}^\{s\}=\\mathbb\{E\}\[L\_\{i\}^\{s\}\]\.

Step 1: Multiplicative form of the expected weights\.By \([6](https://arxiv.org/html/2605.24052#S5.E6)\),wis\+1=wis\(1−αLis\)w\_\{i\}^\{s\+1\}=w\_\{i\}^\{s\}\(1\-\\alpha L\_\{i\}^\{s\}\)\. Sincewisw\_\{i\}^\{s\}is determined by\{pjσ:σ<s\}\\\{p\_\{j\}^\{\\sigma\}:\\sigma<s\\\}andLisL\_\{i\}^\{s\}is determined by\{pjs\}\\\{p\_\{j\}^\{s\}\\\}, and since the realized states are independent across slots,wisw\_\{i\}^\{s\}andLisL\_\{i\}^\{s\}are independent\. Taking expectations conditional on the weight at slott0t\_\{0\}yields𝔼\[wis\+1\]=𝔼\[wis\]⋅\(1−αℓis\)\\mathbb\{E\}\[w\_\{i\}^\{s\+1\}\]=\\mathbb\{E\}\[w\_\{i\}^\{s\}\]\\cdot\(1\-\\alpha\\ell\_\{i\}^\{s\}\)\. Iterating froms=t0s=t\_\{0\}forτ\\tausteps gives, for both workers,

𝔼\[wit0\+τ\]=wit0∏s=t0t0\+τ−1\(1−αℓis\),𝔼\[wkt0\+τ\]=wkt0∏s=t0t0\+τ−1\(1−αℓks\)\.\\displaystyle\\mathbb\{E\}\[w\_\{i\}^\{t\_\{0\}\+\\tau\}\]=w\_\{i\}^\{t\_\{0\}\}\\prod\_\{s=t\_\{0\}\}^\{t\_\{0\}\+\\tau\-1\}\(1\-\\alpha\\ell\_\{i\}^\{s\}\),\\quad\\mathbb\{E\}\[w\_\{k\}^\{t\_\{0\}\+\\tau\}\]=w\_\{k\}^\{t\_\{0\}\}\\prod\_\{s=t\_\{0\}\}^\{t\_\{0\}\+\\tau\-1\}\(1\-\\alpha\\ell\_\{k\}^\{s\}\)\.
Step 2: Per\-slot growth lower bound\.For anys≥t0s\\geq t\_\{0\},

ln⁡1−αℓis1−αℓks=∫ℓisℓksα1−αu𝑑u≥α\(ℓks−ℓis\)≥αΔ,\\displaystyle\\ln\\frac\{1\-\\alpha\\ell\_\{i\}^\{s\}\}\{1\-\\alpha\\ell\_\{k\}^\{s\}\}=\\int\_\{\\ell\_\{i\}^\{s\}\}^\{\\ell\_\{k\}^\{s\}\}\\frac\{\\alpha\}\{1\-\\alpha u\}\\,du\\geq\\alpha\(\\ell\_\{k\}^\{s\}\-\\ell\_\{i\}^\{s\}\)\\geq\\alpha\\Delta,where the first inequality holds sinceα1−αu≥α\\frac\{\\alpha\}\{1\-\\alpha u\}\\geq\\alphaforu∈\[0,1\]u\\in\[0,1\]underα<1/2\\alpha<1/2, and the second follows fromΔ:=mins≥t0⁡\(ℓks−ℓis\)\\Delta:=\\min\_\{s\\geq t\_\{0\}\}\(\\ell\_\{k\}^\{s\}\-\\ell\_\{i\}^\{s\}\)\.

Step 3: Catch\-up time\.Catching up afterτnew\\tau\_\{\\textnormal\{new\}\}updates means𝔼\[wit0\+τnew\]≥𝔼\[wkt0\+τnew\]\\mathbb\{E\}\[w\_\{i\}^\{t\_\{0\}\+\\tau\_\{\\textnormal\{new\}\}\}\]\\geq\\mathbb\{E\}\[w\_\{k\}^\{t\_\{0\}\+\\tau\_\{\\textnormal\{new\}\}\}\], equivalently,

∑s=t0t0\+τnew−1ln⁡\(1−αℓis1−αℓks\)≥ln⁡\(wkt0wit0\)\.\\displaystyle\\sum\_\{s=t\_\{0\}\}^\{t\_\{0\}\+\\tau\_\{\\textnormal\{new\}\}\-1\}\\ln\\bigg\(\\frac\{1\-\\alpha\\ell\_\{i\}^\{s\}\}\{1\-\\alpha\\ell\_\{k\}^\{s\}\}\\bigg\)\\geq\\ln\\bigg\(\\frac\{w\_\{k\}^\{t\_\{0\}\}\}\{w\_\{i\}^\{t\_\{0\}\}\}\\bigg\)\.The sum hasτnew\\tau\_\{\\textnormal\{new\}\}terms, each lower\-bounded byαΔ\\alpha\\Deltafrom Step 2\. Hence the catch\-up condition is satisfied wheneverτnewαΔ≥ln⁡\(wkt0/wit0\)\\tau\_\{\\textnormal\{new\}\}\\alpha\\Delta\\geq\\ln\(w\_\{k\}^\{t\_\{0\}\}/w\_\{i\}^\{t\_\{0\}\}\), giving the upper bound

τnew≤⌈ln⁡\(wkt0/wit0\)αΔ⌉=𝒪\(1αΔ\)=𝒪\(Tln⁡N⋅Δ−1\)\\displaystyle\\tau\_\{\\textnormal\{new\}\}\\leq\\bigg\\lceil\\frac\{\\ln\(w\_\{k\}^\{t\_\{0\}\}/w\_\{i\}^\{t\_\{0\}\}\)\}\{\\alpha\\Delta\}\\bigg\\rceil=\\mathcal\{O\}\\bigg\(\\frac\{1\}\{\\alpha\\Delta\}\\bigg\)=\\mathcal\{O\}\\bigg\(\\sqrt\{\\frac\{T\}\{\\ln N\}\}\\cdot\\Delta^\{\-1\}\\bigg\)after substitutingα=232ln⁡N/T\\alpha=\\frac\{2\}\{3\}\\sqrt\{2\\ln N/T\}from Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)\. The bound depends on the weight ratiowkt0/wit0w\_\{k\}^\{t\_\{0\}\}/w\_\{i\}^\{t\_\{0\}\}at the arrival slot but not ont0t\_\{0\}itself, so it does not grow with the existing worker’s tenure prior tot0t\_\{0\}\. We finish the proof\.□\\square

## Appendix IProof of Proposition[3](https://arxiv.org/html/2605.24052#Thmproposition3)

We prove parts \(a\) and \(b\) in turn\.

Proof of part \(a\)\.Fix workeriiand promptjj; we suppress the slot and prompt indices in this part for brevity\. Under workerii’s Bernoulli beliefpjt∼Bernoulli\(𝒫i\)p\_\{j\}^\{t\}\\sim\\text\{Bernoulli\}\(\\mathcal\{P\}\_\{i\}\), by the law of total probability under the symmetricϵ\\epsilon\-flip,

Pr⁡\(p~jt=1\)\\displaystyle\\Pr\(\\tilde\{p\}\_\{j\}^\{t\}=1\)=\(1−ϵ\)⋅𝒫i\+ϵ⋅\(1−𝒫i\)\\displaystyle=\(1\-\\epsilon\)\\cdot\\mathcal\{P\}\_\{i\}\+\\epsilon\\cdot\(1\-\\mathcal\{P\}\_\{i\}\)=\(1−2ϵ\)𝒫i\+ϵ\.\\displaystyle=\(1\-2\\epsilon\)\\mathcal\{P\}\_\{i\}\+\\epsilon\.Hencep~jt∼Bernoulli\(qϵ\)\\tilde\{p\}\_\{j\}^\{t\}\\sim\\text\{Bernoulli\}\(q\_\{\\epsilon\}\)withqϵ:=\(1−2ϵ\)𝒫i\+ϵq\_\{\\epsilon\}:=\(1\-2\\epsilon\)\\mathcal\{P\}\_\{i\}\+\\epsilon\. Intuitively, the symmetric flip contracts the signal toward1/21/2by a factor of\(1−2ϵ\)\(1\-2\\epsilon\), and recovers𝒫i\\mathcal\{P\}\_\{i\}asϵ→0\\epsilon\\to 0\.

By the mean\-variance decomposition of𝔼\[\(𝒫^i−p~jt\)2\]\\mathbb\{E\}\[\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-\\tilde\{p\}\_\{j\}^\{t\}\)^\{2\}\]underp~jt∼Bernoulli\(qϵ\)\\tilde\{p\}\_\{j\}^\{t\}\\sim\\text\{Bernoulli\}\(q\_\{\\epsilon\}\), the expected weight multiplier at slotttunder noisy verification is

𝔼\[wit\+1\]=wit⋅\[1−α\(𝒫^i−qϵ\)2−αqϵ\(1−qϵ\)\],\\displaystyle\\mathbb\{E\}\[w\_\{i\}^\{t\+1\}\]=w\_\{i\}^\{t\}\\cdot\\big\[1\-\\alpha\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-q\_\{\\epsilon\}\)^\{2\}\-\\alpha q\_\{\\epsilon\}\(1\-q\_\{\\epsilon\}\)\\big\],which is a strictly concave quadratic in𝒫^i\\hat\{\\mathcal\{P\}\}\_\{i\}uniquely maximized at𝒫^i∗=qϵ\\hat\{\\mathcal\{P\}\}\_\{i\}^\{\*\}=q\_\{\\epsilon\}\. The best\-response deviation from truthful reporting is therefore

\|𝒫^i∗−𝒫i\|=\|qϵ−𝒫i\|=ϵ⋅\|1−2𝒫i\|≤ϵ\.\\displaystyle\|\\hat\{\\mathcal\{P\}\}\_\{i\}^\{\*\}\-\\mathcal\{P\}\_\{i\}\|=\|q\_\{\\epsilon\}\-\\mathcal\{P\}\_\{i\}\|=\\epsilon\\cdot\|1\-2\\mathcal\{P\}\_\{i\}\|\\leq\\epsilon\.
To bound the cumulative strategic gain, note that the per\-slot gain from shifting the report from𝒫i\\mathcal\{P\}\_\{i\}toqϵq\_\{\\epsilon\}is at mostαwit\(qϵ−𝒫i\)2≤αwitϵ2\\alpha w\_\{i\}^\{t\}\(q\_\{\\epsilon\}\-\\mathcal\{P\}\_\{i\}\)^\{2\}\\leq\\alpha w\_\{i\}^\{t\}\\epsilon^\{2\}\. Usingwit≤1w\_\{i\}^\{t\}\\leq 1from \([6](https://arxiv.org/html/2605.24052#S5.E6)\) and summing overTTslots,

∑t=1Tαwit\(qϵ−𝒫i\)2≤αϵ2T=𝒪\(ϵ2T\),\\displaystyle\\sum\_\{t=1\}^\{T\}\\alpha w\_\{i\}^\{t\}\(q\_\{\\epsilon\}\-\\mathcal\{P\}\_\{i\}\)^\{2\}\\leq\\alpha\\epsilon^\{2\}T=\\mathcal\{O\}\(\\epsilon^\{2\}\\sqrt\{T\}\),after substitutingα=232ln⁡N/T\\alpha=\\frac\{2\}\{3\}\\sqrt\{2\\ln N/T\}from Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)\. This is dominated by the𝒪\(T\)\\mathcal\{O\}\(\\sqrt\{T\}\)regret term in Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1)and vanishes asϵ→0\\epsilon\\to 0\.

Proof of part \(b\)\.Fix workeriiand promptjj\. For any reported𝒫^i∈\[0,1\]\\hat\{\\mathcal\{P\}\}\_\{i\}\\in\[0,1\]and any labelspjt,p~jt∈\{0,1\}p\_\{j\}^\{t\},\\tilde\{p\}\_\{j\}^\{t\}\\in\\\{0,1\\\},

\|\(𝒫^i−p~jt\)2−\(𝒫^i−pjt\)2\|≤𝟏\{p~jt≠pjt\},\\displaystyle\\big\|\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-\\tilde\{p\}\_\{j\}^\{t\}\)^\{2\}\-\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-p\_\{j\}^\{t\}\)^\{2\}\\big\|\\leq\\mathbf\{1\}\\\{\\tilde\{p\}\_\{j\}^\{t\}\\neq p\_\{j\}^\{t\}\\\},since when the labels differ,\|\(𝒫^i−p~jt\)2−\(𝒫^i−pjt\)2\|=\|2𝒫^i−1\|⋅\|p~jt−pjt\|≤1\|\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-\\tilde\{p\}\_\{j\}^\{t\}\)^\{2\}\-\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-p\_\{j\}^\{t\}\)^\{2\}\|=\|2\\hat\{\\mathcal\{P\}\}\_\{i\}\-1\|\\cdot\|\\tilde\{p\}\_\{j\}^\{t\}\-p\_\{j\}^\{t\}\|\\leq 1, and when they agree, the left\-hand side is zero\. Taking expectation and usingPr⁡\(p~jt≠pjt\)≤ϵ\\Pr\(\\tilde\{p\}\_\{j\}^\{t\}\\neq p\_\{j\}^\{t\}\)\\leq\\epsilon,

\|𝔼\[\(𝒫^i−p~jt\)2\]−𝔼\[\(𝒫^i−pjt\)2\]\|≤ϵ\.\\displaystyle\\big\|\\mathbb\{E\}\[\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-\\tilde\{p\}\_\{j\}^\{t\}\)^\{2\}\]\-\\mathbb\{E\}\[\(\\hat\{\\mathcal\{P\}\}\_\{i\}\-p\_\{j\}^\{t\}\)^\{2\}\]\\big\|\\leq\\epsilon\.
Averaging overj=1,…,mtj=1,\\dots,m\_\{t\}and summing overt=1,…,Tt=1,\\dots,T, the aggregation loss in the first term of \([5](https://arxiv.org/html/2605.24052#S3.E5)\) under noisy verification differs from the clean\-case aggregation loss by at mostϵT\\epsilon T\. The sameϵT\\epsilon Tbound applies to the best\-expert benchmark loss in the second term of \([5](https://arxiv.org/html/2605.24052#S3.E5)\)\. By the triangle inequality and the clean\-case regret bound in Theorem[1](https://arxiv.org/html/2605.24052#Thmtheorem1),

𝔼\[Rℳ\(T\)\]≤3Tln⁡N2\+2ϵT=𝒪\(T\)\+2ϵT\.\\displaystyle\\mathbb\{E\}\[R\_\{\\mathcal\{M\}\}\(T\)\]\\leq 3\\sqrt\{\\frac\{T\\ln N\}\{2\}\}\+2\\epsilon T=\\mathcal\{O\}\(\\sqrt\{T\}\)\+2\\epsilon T\.Dividing byTT,

𝔼\[Rℳ\(T\)\]T≤𝒪\(1T\)\+2ϵ,\\displaystyle\\frac\{\\mathbb\{E\}\[R\_\{\\mathcal\{M\}\}\(T\)\]\}\{T\}\\leq\\mathcal\{O\}\\bigg\(\\frac\{1\}\{\\sqrt\{T\}\}\\bigg\)\+2\\epsilon,recovering the clean\-case bound asϵ→0\\epsilon\\to 0\. We finish the proof\.
Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

Similar Articles

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

Hidden Consensus:Preference-Validity Compression in Human Feedback

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Submit Feedback

Similar Articles

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation
Hidden Consensus:Preference-Validity Compression in Human Feedback
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning