Towards Multi-Agent-Simulation-Based Community Note Evaluation

arXiv cs.AI Papers

Summary

This paper introduces ComRate, a large-scale dataset of community notes and ratings from X, and proposes MultiCom, a persona-guided multi-agent framework for simulating community note evaluation. The approach achieves 84.7% accuracy in predicting note helpfulness.

arXiv:2606.18268v1 Announce Type: cross Abstract: Community-based fact-checking that relies on cross-consensus is expanding rapidly on social media platforms. However, the delay and low-ratio of cross-consensus community fact-checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large-scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from $\mathbb{X}$. We then propose MultiCom, a persona-guided multi-agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out-of-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro-F1 60.1%) on the evaluation set.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:43 AM

# Towards Multi-Agent-Simulation-Based Community Note Evaluation
Source: [https://arxiv.org/html/2606.18268](https://arxiv.org/html/2606.18268)
Changxi Wen1,Shuning Zhang111footnotemark:1,Bohao Chu2,Yuwei Chuai3, Hui Wang2,Dai Shi4,Xin Yi122footnotemark:2,Hewu Li1

1Tsinghua University, Beijing, China 2University of Duisburg\-Essen, Duisburg, Germany 3University of Luxembourg, Luxembourg 4Tongji University, Shanghai, China

###### Abstract

Community\-based fact\-checking that relies on cross\-consensus is expanding rapidly on social media platforms\. However, the delay and low\-ratio of cross\-consensus community fact\-checks rated by human contributors remains a significant challenge\. To address this, we first created ComRate, a large\-scale dataset comprising 2\.5 million community notes and over 209 million ratings sourced from𝕏\\mathbb\{X\}\. We then propose MultiCom, a persona\-guided multi\-agent rating framework for community note evaluation\. MultiCom simulates diverse rater population by clustering contributors in a matrix\-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema\. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons\. An out\-of\-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction\. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84\.7% \(balanced accuracy 68\.3%, macro\-F1 60\.1%\) on the evaluation set\.

Towards Multi\-Agent\-Simulation\-Based Community Note Evaluation

## 1Introduction

Tackling misinformation and disinformation remains a critical priority for social platforms\. While early initiatives relied on professional fact\-checkersMicallefet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib17)\)or automated fact\-checking systemsGuoet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib18)\), these approaches often face high costs and limited scalability\. Crowdsourced fact\-checking has emerged as a scalable alternative, leveraging collective efforts to author “community notes” – short, evidence\-based contexts designed to debunk misleading postsPröllochs \([2022](https://arxiv.org/html/2606.18268#bib.bib8)\)and curb the spread of misinformationChuaiet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib9),[2026c](https://arxiv.org/html/2606.18268#bib.bib38)\)\. Such programs have been operational on platforms like𝕏\\mathbb\{X\}for over five years, spanning from early 2021 to 2026\.

However, the debunking community notes still needs human raters to determine their actual helpfulnessChuaiet al\.\([2026b](https://arxiv.org/html/2606.18268#bib.bib10)\); Pröllochs \([2022](https://arxiv.org/html/2606.18268#bib.bib8)\)\. In fact, this is not unique to such crowdsourced fact\-checking systems\. Even professional fact\-checkers engage in cross\-checking when doing fact\-check work\. These work are shown to improve fact\-checks’ comprehensivenessWarrenet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib3)\); Micallefet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib17)\)\. In the era of Generative AI, where automated fact\-checking are increasingly proposed and adoptedNakovet al\.\([2021](https://arxiv.org/html/2606.18268#bib.bib5)\); Guoet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib18)\), how to evaluate the generated fact\-checking materials is important, especially from a user\-centric perspective\.

Existing literature predominantly focuses on automated note generationDeet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib13)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib14)\), scaffolding note\-writing workflowsXinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\), or general automated fact\-checking architecturesNakovet al\.\([2021](https://arxiv.org/html/2606.18268#bib.bib5)\); Guoet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib18)\)\. Crucially, the sparse research addressing note evaluationXinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\)fails to account for complex evaluation dynamics, such as intermediate rating statuses\.

Towards these challenges, we first construct ComRate, a large\-scale real\-world dataset comprising 209,290,533 ratings towards 2,566,644 community notes, spanning 1,698,835 posts from Jan 2021 to Apr 2026\. We then propose MultiCom, a persona\-guided multi\-agent rating algorithm designed for debunking note evaluation\. MultiCom leverages matrix factorization to cluster agents that mimic heterogeneous human rater personas\. These agents perform explainable reasoning across multiple nuanced quality dimensions, such as evidence strength and claim coverage, instead of providing simple binary labels\. Finally, a selective aggregation agent employs cross\-validation and dual\-threshold decision rules to prioritize reliable outcomes and resolve “needs more ratings” cases\.

Extensive evaluations on ComRate demonstrate that MultiCom outperforms alternative methods\. It further generalizes to unseen future notes, and notes with different reasons\. Our contributions are three\-fold:

- •Method:We introduce MultiCom, a multi\-agent framework that uses persona\-guided simulation and multi\-dimensional reasoning for explainable note evaluation\.
- •Dataset:We provide ComRate, the most comprehensive real\-world rating dataset of community notes and human ratings\.
- •Empirical:We demonstrate MultiCom’s effectiveness, generalizability across time and models, and its ability to provide diagnostic feedback for improving fact\-checking quality\.

## 2ComRate

We constructed the ComRate dataset using data from the𝕏\\mathbb\{X\}API and the platform’s official open\-source repository111[https://communitynotes\.x\.com/guide/en/under\-the\-hood/download\-data](https://communitynotes.x.com/guide/en/under-the-hood/download-data)\. The resulting dataset comprised 209,290,533 ratings on 2,566,644 community notes attached to 1,698,835 posts\. The data spans a five\-year period from January 28, 2021 to April 5, 2026\.

To provide insights into the fact\-checking ecosystem, we conducted an analysis of the dataset \(Fig[1](https://arxiv.org/html/2606.18268#S2.F1), detailed methodology see Appendix[B\.2](https://arxiv.org/html/2606.18268#A2.SS2)\)\. First, the temporal distribution highlights a rapid adoption and scaling of the program, with the volume of notes, posts, and ratings peaking prominently in 2024 \(Fig[1](https://arxiv.org/html/2606.18268#S2.F1)\(a\)\)\. Second, Fig[1](https://arxiv.org/html/2606.18268#S2.F1)\(b\) provides a heatmap of standardized behavioral profiles across distinct rater clusters, with feature definitions and z\-score normalization detailed in Appendix[B\.2](https://arxiv.org/html/2606.18268#A2.SS2)\. This underscores the necessity of modeling diverse evaluator personas instead of assuming a uniform rater population\. Third, we examined the distribution of misinformation categories and note\-to\-post ratios \(Fig[1](https://arxiv.org/html/2606.18268#S2.F1)\(c\)\)\. We found “Factual error” and “Manipulated media” are the most frequent categories, and the vast majority of posts have a single note\.

Additional analyses including language distribution and character\-length statistics, are detailed in Appendix[B\.3](https://arxiv.org/html/2606.18268#A2.SS3)\. Since the official Community Notes release does not provide complete post text for all notes, these additional full\-dataset statistics are computed from note text and official note metadata\.

Our task focuses on predicting whether or not a note is helpful\. This aligns with the criteria of Community Note program on𝕏\\mathbb\{X\}\. We define helpful aswhether the note provides important context that helps a person recontextualize the original post\. For a note to be helpful, it should satisfy the following dimensions, as recommended by𝕏\\mathbb\{X\}\. It should bewell\-sourced\(relevant and high\-quality citations\),clear\(easily understandable language\),comprehensive\(addressing all key claims\),relevant\(providing crucial context\), andneutral\(free from argumentative, speculative, or biased rhetoric\)\.

Our evaluation classify each note into one of three statuses, as in𝕏\\mathbb\{X\}:Helpful\(denoted asH\),Not Helpful\(NH\), andNeeds More Ratings\(NMR\)\. We denoteH/NHasresolved status, and retainNMRas some notes are contested and could not be simply classified into binary classes\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/x1.png)\(a\)Temporal growth of community notes records\.
![Refer to caption](https://arxiv.org/html/2606.18268v1/x2.png)\(b\)Rater\-cluster behavioral profiles\.
![Refer to caption](https://arxiv.org/html/2606.18268v1/x3.png)\(c\)Distributions of note categories and notes per post\.

Figure 1:Descriptive analysis of ComRate dataset\.
## 3MultiCom

### 3\.1Algorithm Pipeline

We design the pipeline as a multi\-agent system, emphasizing diversity, diagnostic judgment, and reliable final decisions, as shown in Figure[2](https://arxiv.org/html/2606.18268#S3.F2)\. we adopted a multi\-agent simulation structure for rating, which constructs cluster\-grounded agents from a matrix\-factorization structure, so that the agents reflect heterogeneity during the rating process\. For each agent, we ask it to produce multi\-dimensional judgments instead of only a binary label, as helpfulness depends on multi\-faceted aspects such as stance and evidence quality\. Preserving these information could increase aggregation effectiveness and provide explainable reasoning\. Finally, the aggregation module decide on cluster\-level and agent\-level features to reliably aggregate results\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/Figure/new-framework.png)Figure 2:The algorithm flow of MultiCom\.
### 3\.2Rater\-Grounded Persona Simulation

To analyze and learn raters’ persona, we first learn a contributor space with biased rank\-one matrix factorization,

ri​j≈μ\+αi\+βj\+ui​vjr\_\{ij\}\\approx\\mu\+\\alpha\_\{i\}\+\\beta\_\{j\}\+u\_\{i\}v\_\{j\}\(1\)
whereri​jr\_\{ij\}denotes the observed rating from contributoriito notejj,μ\\muis the global intercept,αi\\alpha\_\{i\}andβj\\beta\_\{j\}are contributor\- and note\-specific biases, anduiu\_\{i\}andvjv\_\{j\}are one\-dimensional latent factors\. We then cluster contributors in this learned rater space to obtain distinct behavioral groups\. For each cluster, we summarize its empirical rating profile, including the cluster size, mean historical helpfulness rating, tendency to rate helpful/not\-helpful, agreement tendency, and reason\-selection patterns when such reason annotations are available\. Each cluster is then converted into a persona prompt that instructs the corresponding agent to evaluate notes according to this cluster\-level rating behavior\. In this way, agents simulate empirically observed rater groups with different agreement tendencies, helpfulness priors, strictness levels, and sensitivities to note\-quality dimensions \(e\.g\., source quality, claim coverage\)\.

### 3\.3Multi\-dimensional Agent Prediction

Each agent outputs a structured judgment following the Community Notes program schema\. For a post\-note pair\(p,n\)\(p,n\), agentaaproduces

za​\(p,n\)=\(ya,𝐬a,ca,𝐪a,𝐟a,ra\),z\_\{a\}\(p,n\)=\(y\_\{a\},\\mathbf\{s\}\_\{a\},c\_\{a\},\\mathbf\{q\}\_\{a\},\\mathbf\{f\}\_\{a\},r\_\{a\}\),\(2\)
whereya∈\{helpful,somewhat helpful,not helpful\}y\_\{a\}\\in\\\{\\texttt\{helpful\},\\texttt\{somewhat helpful\},\\texttt\{not helpful\}\\\}is the agent’s overall helpfulness rating\. The stance vector𝐬a\\mathbf\{s\}\_\{a\}contains the agent’s agreement signals, including “agree” and “disagree”\. The confidence signalcac\_\{a\}records how confident the agent is in its judgment\. The quality vector𝐪a\\mathbf\{q\}\_\{a\}contains helpfulness reasons used in community notes’ ratings:Clear,GoodSources,AddressesClaim,ImportantContext, andUnbiasedLanguage\. These dimensions capture whether the note is clear, well\-supported, directly addresses the claim, provides important context, and uses neutral language\. The failure vector𝐟a\\mathbf\{f\}\_\{a\}contains not helpful reasons:Incorrect,SourcesMissingOrUnreliable,MissingKeyPoints,HardToUnderstand,ArgumentativeOrBiased,IrrelevantSources,OpinionSpeculation, andNoteNotNeeded\. These variables capture common failure modes for community notes\. Finally,rar\_\{a\}is an auxiliary diagnostic signal measuring whether the note changes the reader’s understanding of the post\.

These dimensions take inspirations from community notes’ rating process\. In Community Note program, raters explain why a note is helpful or not helpful through predefined reason categories, including whether the note is clear, well\-sourced, incorrect, unnecessary, etc\. MultiCom adopts this structure by eliciting diverse reason\-level signals from simulated raters and using these features to augment aggregation\. This design preserves preference nuances, where two notes may receive similar helpfulness votes while the other features such as source quality or claim coverage are different\. In our representation,yay\_\{a\}and𝐬a\\mathbf\{s\}\_\{a\}capture the agent’s rating stance,cac\_\{a\}captures confidence,𝐪a\\mathbf\{q\}\_\{a\}captures positive quality evidence, and𝐟a\\mathbf\{f\}\_\{a\}captures diagnostic failure modes\.

### 3\.4Calibrated Multi\-View Aggregation

Obtaining structured outputs in JSON format from different persona agents, MultiCom clusters them into several note\-level features, including raw vote distributions, confidence statistics, consistency signals, features regarding the reasons of helpfulness or not\-helpfulness, cluster\-level disagreement patterns, and features derived from the metadataLiuet al\.\([2023](https://arxiv.org/html/2606.18268#bib.bib21)\); Hashemiet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib22)\); Yeet al\.\([2023](https://arxiv.org/html/2606.18268#bib.bib23)\)\. A complete list of feature views and out\-of\-fold predictors is provided in Appendix[D\.1](https://arxiv.org/html/2606.18268#A4.SS1)\. These features enable the aggregator to model how agents vote and why they vote that way\.

We then use an out\-of\-fold method to process all learned aggregation componentsWolpert \([1992](https://arxiv.org/html/2606.18268#bib.bib24)\); Kaufmanet al\.\([2012](https://arxiv.org/html/2606.18268#bib.bib25)\)\. Specifically, for each individual note, the intermediate predictions used by the final aggregator are generated by models that were not trained using that particular note\. This avoids over\-fitting\.

MultiCom finally integrates multiple complementary out\-of\-fold predictors, including weighted ensemble predictionsDietterich \([2000](https://arxiv.org/html/2606.18268#bib.bib26)\); Caruanaet al\.\([2004](https://arxiv.org/html/2606.18268#bib.bib27)\), gated ensemble predictions, rescue\-gate predictionsJacobset al\.\([1991](https://arxiv.org/html/2606.18268#bib.bib28)\), rationale blend predictions, and metadata predictions\. For a notenn, each predictormmproduces a labely^m,n∈\{NH,NMR,H\}\\hat\{y\}\_\{m,n\}\\in\\\{\\texttt\{NH\},\\texttt\{NMR\},\\texttt\{H\}\\\}\. The final class score is calculated as

Sc​\(n\)=∑mwm​𝕀​\(y^m,n=c\),S\_\{c\}\(n\)=\\sum\_\{m\}w\_\{m\}\\mathbb\{I\}\(\\hat\{y\}\_\{m,n\}=c\),\(3\)
wherewmw\_\{m\}is the weight assigned to predictormm\. The final prediction isy^n=arg⁡maxc⁡Sc​\(n\)\.\\hat\{y\}\_\{n\}=\\arg\\max\_\{c\}S\_\{c\}\(n\)\.

Furthermore, in instances where the initial prediction from the ensemble model isNMR, we employ a conservative upgrading rule\. Specifically, if two auxiliary out\-of\-fold predictors consistently predict the same resolved label \(i\.e\.,HorNH\), and the diagnostic statistics at the voting level satisfy a preset threshold, we upgrade the prediction result fromNMRto that resolved label\. Detailed information regarding the auxiliary predictors, their input, and the upgrading thresholds are in Appendix[D\.1](https://arxiv.org/html/2606.18268#A4.SS1)\.

## 4Experiments

### 4\.1Methods

We compare methods representing direct helpfulness prediction\. We exclude other multi\-agent systemsWuet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib35)\); Parket al\.\([2023](https://arxiv.org/html/2606.18268#bib.bib36)\)due to their similarity to MultiCom’s ablation settings, and omit fact\-checking systemsWanget al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib1)\)as their tasks diverge from helpfulness prediction:

∙\\bulletSingle Agent: This setting explores whether one agent alone is sufficient to substitute the multi\-agent simulation process\. It receives the same post and note as MultiCom, and is tasked with generating a set of ratings similar to the agent in MultiCom, including overall helpfulness status, confidence, agreement signals, helpfulness reasons, not\-helpfulness reasons, and diagnostic signals\. Unlike MultiCom, this baseline employs no persona prompts derived from evaluator\. We fed the outputs generated by this single agent into an out\-of\-fold calibration model, identical to those of MultiCom, ensuring fair comparisons\.

∙\\bulletFine\-tuned Model: Following prior workXinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\), we use Mistral\-7B\-Instruct\-v0\.3 as the backbone model and fine\-tune it with LoRAHuet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib30)\); Nguyenet al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib29)\); Xinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\)on the ComRate training set\. Following Xing et al\.Xinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\), each input instance consists of the post text, the community note text, and a classification instruction \(predictH/NH/NMR\)\. Additional implementation details are provided in Appendix[D\.3](https://arxiv.org/html/2606.18268#A4.SS3)\.

∙\\bulletMultiCom: this setting is detailed in Sec[3](https://arxiv.org/html/2606.18268#S3)\.

Table 1:Performance comparison \(accuracy, balanced accuracy, and Macro\-F1\) across different methods on the ComRate dataset\. Subscripts denote binomial standard errorsBasharatet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib31)\)\.
### 4\.2Evaluation Process

We evaluated all methods in the ComRate evaluation set, containing 2,000 samples, where we applied stratified sampling based on the notes’ creation year, topic, and final status\. Each instance includes a post, a community note, and a ground\-truth label corresponding to its official status:H,NH, orNMR\. TheNMRcategory is the majority class, representing 88\.75% of the evaluation dataset\.

All trainable components, including fine\-tuned models and calibrated aggregation models, were evaluated using a 5\-fold stratified out\-of\-fold methodology\. In each fold, 80% of data was used for training, and the remaining 20% served as the held\-out test set\. Final metrics were calculated by aggregated predictions across all five held\-out sets, ensuring no model was tested on its training data\.

Given the class imbalance, we evaluate performance using accuracy, balanced accuracy \(13​∑i=13Recalli\\frac\{1\}\{3\}\\sum\_\{i=1\}^\{3\}\\text\{Recall\}\_\{i\}\) and macro\-F1 \(13​∑i=13F1i\\frac\{1\}\{3\}\\sum\_\{i=1\}^\{3\}\\text\{F1\}\_\{i\}\)\.

### 4\.3Parameters

In MultiCom, the number of agents determines the volume of simulated judgments\. Our primary setup uses 16 persona agents, each mapping to a distinct rater cluster derived from matrix factorization\. When scaling to 32 or 48 agents, we retain the original 16 cluster profiles and assign multiple independent replicas to each\. The replicas mimics randomness of each persona agent, which improves rating diversity\. For calibrated aggregation, we use nested cross\-validation\. The outer loop uses 5\-fold stratified cross\-validation for final evaluation\. Within each outer training split, we used inner 5\-fold cross\-validation to set logistic regularization strength, class weights, and decision thresholdsAdekoyaet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib20)\); Stenhouseet al\.\([2021](https://arxiv.org/html/2606.18268#bib.bib19)\)\.

## 5Results

### 5\.1Main Study

Table[1](https://arxiv.org/html/2606.18268#S4.T1)summarizes the performance\. MultiCom achieves the best balanced accuracy of 68\.3% and Macro\-F1 of 60\.1%, while maintaining high overall accuracy of 84\.7%\. The fine\-tuned LoRA baseline performs substantially worse than MultiCom, especially in balanced accuracy and Macro\-F1, indicating that direct text classification struggles with the three\-way helpfulness task under severe class imbalance\.

A category\-level analysis reveals that the fine\-tuned model has lower accuracy on specific categories, such as “Satire” \(43\.8%\) and “Outdated information” \(52\.1%\)\. This suggests the limitations of relying solely on text classification for community note evaluation without crowdsourced reasoning\. Conversely, MultiCom maintains stronger and more balanced performance across diverse factual contexts, achieving high accuracy on “Not misleading” \(90\.9%\), “Factual error” \(84\.1%\), and “Missing important context” notes \(84\.1%\)\. To address class imbalance, we further constructed a balanced evaluation set of 1,998 notes \(666 per class\) while preserving the natural distribution of other features\. On this balanced set, MultiCom consistently outperforms alternative methods\. For instance, MultiCom achieves an accuracy, balanced accuracy, and macro\-F1 of 68\.0%, 68\.0%, and 67\.3%, whereas the single\-agent baseline only yields 49\.3%, 49\.2%, and 46\.2%, respectively \(see Appendix[C\.3](https://arxiv.org/html/2606.18268#A3.SS3)for details\)\.

### 5\.2Ablation Study

We conduct ablation study to compare MultiCom with two ablated variants:MultiCom w/o Cluster, removing cluster\-grounded persona profiles while preserving the structured output schema, andMultiCom w/o MultiDim, removing multi\-dimensional diagnostic signals and relying only on agents’ helpfulness votes\.

As in Table[1](https://arxiv.org/html/2606.18268#S4.T1), MultiCom achieves the best overall performance, with an average accuracy of 84\.7%, balanced accuracy of 68\.3%, and Macro\-F1 of 60\.1%\. Removing rater\-grounded clustering reduces average accuracy from 84\.7% to 80\.5%, while balanced accuracy and Macro\-F1 drop sharply from 68\.3% to 38\.8% and from 60\.1% to 38\.3%, respectively\.

Conversely, removing multi\-dimensional diagnostic mechanism leads to more severe perfomance degradation\.MultiCom w/o MultiDimachieves only 60\.4% average accuracy, 46\.1% balanced accuracy, and 34\.6% Macro\-F1\. Compared to MultiCom, these metrics dropped by 24\.3, 22\.2, and 25\.5 percentage respectively\. The performance decline is particularly acute for categories such asNot misleading, where accuracy decreases from 90\.9% to 37\.4%\. These results show that diagnostic dimensions, such as evidence quality and claim coverage, provide important signals for classification\.

### 5\.3Generalizability Study

As detailed in Table[2](https://arxiv.org/html/2606.18268#S5.T2), MultiCom showed generalizability across models and temporal aspects\. For models, we use the same evaluation set comprising 2000 notes and varied backbone models \(e\.g\., qwen, claude\)\. Regarding temporal aspects, we employ a rolling future\-prediction method, predicting the future 1\-3 years’ note helpfulness based on prior 1\-3 years’ windows \(see Appendix[D\.4](https://arxiv.org/html/2606.18268#A4.SS4)for details and justification\)\. All feasible temporal windows are averaged when reporting accuracies\. MultiCom achieved peak average accuracies of 73\.1% and 71\.5% using DeepSeek\-v3\.2 and GPT\-5\.1\. In multi\-year settings, it has accuracies between 84\.9% and 87\.1% across one\- to three\-year forecasting windows, indicating minimal performance degradation over time\.

Table 2:Accuracy across different generalizability settings\. Underscripts denoted binomial standard errors\.Figure[3](https://arxiv.org/html/2606.18268#S5.F3)evaluates whether MultiCom remains effective with different numbers of persona agents\. Here, we retain the same set of 16 cluster personality profiles, assigning multiple independent agents to each cluster, 2 for 32\-agent configuration and 3 for 48\-agent configuration\. Agents belonging to the same cluster share an identical cluster\-specific prompt, yet processing queries independently\. This setting aims to assess the impact of conducting repeated, independent simulations within each evaluation group\. Results show that the 32\-agent and 48\-agent settings achieve higher overall accuracy than the 16\-agent setting\. However, balanced accuracy degraded, where more agents cause MultiCom to behave conservatively when handling theNMRcategory, reducing recall forHorNH\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/x4.png)Figure 3:Accuracy and precision across different agent numbers, where “1 agent\*” is the baseline without persona simulation\.
### 5\.4Agent\-Human Alignment

To evaluate structural alignment between agent and human rating patterns, we conducted a Representational Similarity Analysis \(RSA\) across the 16 clusters, a standard paradigm for human\-AI alignment evaluationSucholutskyet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib32)\); Wynnet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib33)\)\. Representational Dissimilarity Matrices \(RDMs\) for both humans and agents were constructed by computing the absolute difference between mean note\-level scores for each cluster pair, restricted to mutually evaluated notes\. We then calculated RSA score as Spearman correlation between the upper\-triangular entries of the aligned RDMs\. A label\-permutation test \(2,000 iterations\) confirmed significant structural alignment \(r=0\.459r=0\.459,p=\.0015p=\.0015, see Figure[4](https://arxiv.org/html/2606.18268#S5.F4)\)\. Additionally, we tested its generalizability across all models, finding consistently high RSA across all models \(e\.g\., claude:r=0\.714r=0\.714,p=\.0005p=\.0005; qwen:r=0\.329r=0\.329,p=\.0135p=\.0135\)\. We further tested its generalizability across different temporal aspects, finding that agent simulations exhibit consistent structures with human clusters even for predicting future data \(Pred year:r=0\.376r=0\.376,p=\.0015p=\.0015; Pred 2Years:r=0\.376r=0\.376,p=\.002p=\.002; Pred 3Years:r=0\.376r=0\.376,p=\.0015p=\.0015; see Appendix[C\.2](https://arxiv.org/html/2606.18268#A3.SS2)for details\)\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/Figure/rsa_0459_reconstruction_fig_distance.png)Figure 4:Representational alignment between humans’ ratings and agents’ ratings\.
### 5\.5Computational Cost and Latency

MultiCom’s computational cost comes from LLM\-based agent simulation\. The aggregation stage requires no additional LLM usage, and only involves feature construction and logistic\-regression inference\. The price of GPT\-5\.4\-nano is $0\.20/$1\.25 per million input/output tokens\. Therefore, on average MultiCom costs $0\.0076 for each note\. Across all models we benchmarked, MultiCom costs between $0\.0075 and $0\.170 per note, suggesting its economical viability\.

### 5\.6Robustness on Instable Notes

We further demonstrated MultiCom’s robustness through examining notes with status\-transition signals and notes with or without author\-indicated trustworthy sources\. Status\-transition notes obtain lower accuracy than the full evaluation set \(75\.7% vs\. 84\.7%\), suggesting that volatile notes are more ambiguous\. The trustworthy\-source split shows stability of MultiCom even for notes without trustworthy sources, which shows even higher accuracy \(87\.8%\)\. Full results are provided in Appendix[C\.1](https://arxiv.org/html/2606.18268#A3.SS1)\.

### 5\.7Effects of NMR Ratio

BecauseNMRis the majority class in ComRate, we explore how the predictedNMRratio affects MultiCom’s performance by varying the decision threshold\. The ground\-truth ratio in the dataset is 88\.75%\. As depicted in Figure[5](https://arxiv.org/html/2606.18268#S5.F5), we found a trade\-off between balanced accuracy, and precision forHandNHclasses\. Increasing the predictedNMRratio improves overall accuracy due to class imbalance, and enhances the precision ofHandNHbut at the cost of balanced accuracy\. Conversely, lowering the ratio toward 70\.00% improves balanced accuracy but diminishes precision ofHandNH\. This suggests thatNMRthreshold could function as an adjustment parameter, where conservative abstention lead to more reliable predictions\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/x5.png)Figure 5:Accuracy and resolved\-class precision across varied predicted NMR ratios\.Table 3:Per\-label and overall multi\-label prediction performance for community notes reason labels\. The H/NH subset includes only notes whose Helpful/Not Helpful status is correctly predicted by MultiCom\.
### 5\.8Fine\-grained Reasons Prediction

To provide fine\-grained prediction rationales, similar to Xing et al\.Xinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\), we examined accuracy of predicting fine\-grained reason labels, including fourHlabels and fiveNHlabels adopted from the Community Notes program \(details in Appendix[D\.5](https://arxiv.org/html/2606.18268#A4.SS5)\)\. Table[3](https://arxiv.org/html/2606.18268#S5.T3)reports the reason prediction’s results\. Features extracted by MultiCom achieve a Micro\-F1, Macro\-F1, and Sample\-F1 of 54\.3%, 42\.6%, and 53\.3% respectively\. We further report results where MultiCom correctly predicts theH/NHstatuses, thereby isolating label prediction errors\. In this subset, Micro\-F1 increases to 56\.2% while Sample\-F1 reaches 55\.9%\. Among various reason\-level metrics, those high\-frequency dimensions are most accurate, such asClear,AddressesClaim, andImportantContext, indicating MultiCom’s effectiveness for status prediction\.

## 6Discussion

Grounded simulation\.Single agent evaluations often collapse into artificial consensus, especially for evaluation needing multiple perspectives\. MultiCom anchors persona agents in a matrix\-factorized space derived from real\-world rating behaviors, preserving the genuine heterogeneity and ideological disagreements of human raters\.

Trade\-offs\.As analyzed, the predicted NMR ratio could serve as a lever, where platforms could dynamically balance the trade\-off between abstention and H/NH recall\. For example, platforms can enforce stricter thresholds during breaking news events to avoid premature judgments\.

Explainability\.Beyond binary classification, MultiCom yields actionable diagnostic feedback detailing specific failure modes and evidence quality\. This transparency supports human\-in\-the\-loop systems, allowing users to verify the underlying reasoning prior to formalized consensus\.

Applications\.MultiCom could be used to evaluate note helpfulness for platform deployment, and guiding contributors in revisions via multi\-dimensional feedback\. These interpretable scores can assist novice fact\-checkers and function as reward signals for AI\-generated notes, such as for the𝕏\\mathbb\{X\}’s “AI Note Writer” or “Collaborative Note”\.

## 7Related Work

Community\-based fact\-checking\.The rapid online spread of misinformation has become a critical societal concernScheufele and Krause \([2019](https://arxiv.org/html/2606.18268#bib.bib7)\)\. In response, community\-based fact\-checking emerged, leveraging the crowd’s intelligence for detecting misinformationPröllochs \([2022](https://arxiv.org/html/2606.18268#bib.bib8)\); Borensteinet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib12)\), and writing structured notes to deter the spreading of misinformationChuaiet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib9)\)\. They are shown to be highly effective in countering misinformationChuaiet al\.\([2026b](https://arxiv.org/html/2606.18268#bib.bib10)\), while also be known to have resilience issuesChuaiet al\.\([2026a](https://arxiv.org/html/2606.18268#bib.bib11)\)\. Many recent initiatives began to use AI to automate the community\-based fact\-checking processesDeet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib13)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib14),[2026](https://arxiv.org/html/2606.18268#bib.bib37)\), especially in specific domainsWuet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib15)\)\.

Helpfulness evaluation\.Verifying the accuracy of fact\-checks is essentialWanget al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib1)\)\. This could be achieved through cross\-checking referencesSmeroset al\.\([2021](https://arxiv.org/html/2606.18268#bib.bib2)\)\. For fact\-checkers, this also involve scrutinization among organizationsWarrenet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib3)\); Juneja and Mitra \([2022](https://arxiv.org/html/2606.18268#bib.bib4)\)\. More importantly, because debunking texts authored by crowds or AI can be inconsistent, evaluating their helpfulness remains more crucial for improving fact\-checking capabilitiesNakovet al\.\([2021](https://arxiv.org/html/2606.18268#bib.bib5)\)\. While early work proposed fine\-tuning models to assess community notes’ helpfulness, they overlook theNMRstatus, and ignore notes’ temporal dynamics, where newer notes contain updated context unavailable to older ones\.

## 8Conclusion

This paper introduces MultiCom, a persona\-guided multi\-agent framework designed to evaluate crowdsourced debunking notes\. We construct ComRate, a large\-scale, real\-world rating dataset based on𝕏\\mathbb\{X\}’s Community Notes\. By modeling empirical behavioral heterogeneity and eliciting multi\-dimensional diagnostic judgments, MultiCom accurately simulates human rater consensus\. Extensive evaluations show that our framework significantly outperforms alternative baselines, providing a scalable, accurate, and explainable solution for automated content governance\.

## 9Limitations

This paper conducted experiments primarily around the Community Note dataset\. The effectiveness of multi\-agent simulations has not been validated across other content governance platforms \(e\.g\., Meta, YouTube\), which may possess different user demographics, moderation architectures, and interface constraints\. Besides, the current dataset and evaluation pipeline are primarily centered on English\-language interactions and Western\-centric misinformation contexts\. Future work could expand the scope to general fact\-checking datasets and contexts\.

## 10Ethical considerations

This paper uses the ComRate dataset, which was constructed using publicly available data, or data from the𝕏\\mathbb\{X\}API\. The data collection process strictly adheres to the platform’s terms of service and data use guidelines\. Because the official Community Notes ecosystem anonymizes contributor identities by design, our dataset and subsequent multi\-agent simulations do not compromise personally identifiable information or individual user privacy\.

A potential ethical consideration in automated fact\-checking is the potential for algorithmic bias\. To mitigate the risk of artificial consensus often found in single\-agent evaluations, MultiCom anchors its persona agents in a matrix\-factorized space derived from real\-world rating behaviors\. While this design intentionally preserve the heterogeneity and nuanced ideological disagreements of human raters, we acknowledge that the underlying LLMs driving the agents may still have biases from their pre\-training corpora\. To counterbalance this, MultiCom relies on multi\-dimensional diagnostics judgments instead of binary classifications\. This potentially improves evidence quality, and enabling investigation of potential failures modes\.

Finally, we emphasized the potential dual\-use nature of this framework\. MultiCom is designed to support, instead of replacing the human\-driven content moderation\. It could empower human users to verify the simulated reasoning prior to reaching a formalized consensus, instead of discrediting specific notes, or fact\-checkers\. By providing these feedback such as helpfulness ratings and reasons, MultiCom is developed to foster a healthier information ecosystem\.

## References

- A\. Adekoya, F\. Saeed, W\. Ghaban, and S\. N\. Qasem \(2025\)Ensemble learning approach with explainable ai for improved heart disease prediction\.Frontiers in Pharmacology16,pp\. 1654681\.Cited by:[§4\.3](https://arxiv.org/html/2606.18268#S4.SS3.p1.1)\.
- H\. Basharat, S\. Plotkin, C\. Le, K\. Zhu, M\. Pink, and I\. Alfaro \(2025\)VariantBench: a framework for evaluating llms on justifications for genetic variant interpretation\.InThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 314–321\.Cited by:[Table 1](https://arxiv.org/html/2606.18268#S4.T1)\.
- N\. Borenstein, G\. Warren, D\. Elliott, and I\. Augenstein \(2025\)Can community notes replace professional fact\-checkers?\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 535–552\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- R\. Caruana, A\. Niculescu\-Mizil, G\. Crew, and A\. Ksikes \(2004\)Ensemble selection from libraries of models\.InProceedings of the twenty\-first international conference on Machine learning,pp\. 18\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p3.3)\.
- Y\. Chuai, G\. Lenzini, and N\. Pröllochs \(2026a\)Consensus stability of community notes on x\.InProceedings of the ACM Web Conference 2026,pp\. 8885–8896\.Cited by:[§C\.1](https://arxiv.org/html/2606.18268#A3.SS1.p1.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- Y\. Chuai, M\. Pilarski, T\. Renault, D\. Restrepo\-Amariles, A\. Troussel\-Clément, G\. Lenzini, and N\. Pröllochs \(2026b\)Community\-based fact\-checking reduces the spread of misleading posts on x \(formerly twitter\)\.Nature Communications17\(1\),pp\. 4070\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p2.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- Y\. Chuai, H\. Tian, N\. Pröllochs, and G\. Lenzini \(2024\)Did the roll\-out of community notes reduce engagement with misinformation on x/twitter?\.Proceedings of the ACM on human\-computer interaction8\(CSCW2\),pp\. 1–52\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p1.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- Y\. Chuai, S\. Zhang, Z\. Wang, X\. Yi, M\. Mosleh, and G\. Lenzini \(2026c\)Request a note: how the request function shapes x’s community notes system\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p1.1)\.
- S\. De, M\. A\. Bakker, J\. Baxter, and M\. Saveski \(2025\)Supernotes: driving consensus in crowd\-sourced fact\-checking\.InProceedings of the ACM on Web Conference 2025,pp\. 3751–3761\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p3.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- T\. G\. Dietterich \(2000\)Ensemble methods in machine learning\.InInternational workshop on multiple classifier systems,pp\. 1–15\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p3.3)\.
- Z\. Guo, M\. Schlichtkrull, and A\. Vlachos \(2022\)A survey on automated fact\-checking\.Transactions of the association for computational linguistics10,pp\. 178–206\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p1.1),[§1](https://arxiv.org/html/2606.18268#S1.p2.1),[§1](https://arxiv.org/html/2606.18268#S1.p3.1)\.
- H\. Hashemi, J\. Eisner, C\. Rosset, B\. Van Durme, and C\. Kedzie \(2024\)Llm\-rubric: a multidimensional, calibrated approach to automated evaluation of natural language texts\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13806–13834\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p1.1)\.
- E\. J\. Hu, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§D\.3](https://arxiv.org/html/2606.18268#A4.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p3.1)\.
- R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton \(1991\)Adaptive mixtures of local experts\.Neural computation3\(1\),pp\. 79–87\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p3.3)\.
- P\. Juneja and T\. Mitra \(2022\)Human and technological infrastructures of fact\-checking\.Proceedings of the ACM on Human\-Computer Interaction6\(CSCW2\),pp\. 1–36\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p2.1)\.
- S\. Kaufman, S\. Rosset, C\. Perlich, and O\. Stitelman \(2012\)Leakage in data mining: formulation, detection, and avoidance\.ACM Transactions on Knowledge Discovery from Data \(TKDD\)6\(4\),pp\. 1–21\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p2.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: nlg evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 2511–2522\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p1.1)\.
- N\. Micallef, V\. Armacost, N\. Memon, and S\. Patil \(2022\)True or false: studying the work practices of professional fact\-checkers\.Proceedings of the ACM on Human\-Computer Interaction6\(CSCW1\),pp\. 1–44\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p1.1),[§1](https://arxiv.org/html/2606.18268#S1.p2.1)\.
- P\. Nakov, D\. Corney, M\. Hasanain, F\. Alam, T\. Elsayed, A\. Barron\-Cedeno, P\. Papotti, S\. Shaar, G\. Da San Martino,et al\.\(2021\)Automated fact\-checking for assisting human fact\-checkers\.InIJCAI,pp\. 4551–4558\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p2.1),[§1](https://arxiv.org/html/2606.18268#S1.p3.1),[§7](https://arxiv.org/html/2606.18268#S7.p2.1)\.
- V\. Nguyen, H\. Nguyen, D\. Vu,et al\.\(2026\)Parameter\-efficient fine\-tuning of small language models for code generation: a comparative study of gemma, qwen 2\.5 and llama 3\.2\.\.International Journal of Electrical & Computer Engineering \(2088\-8708\)16\(1\)\.Cited by:[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p3.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p1.1)\.
- N\. Pröllochs \(2022\)Community\-based fact\-checking on twitter’s birdwatch platform\.InProceedings of the International AAAI Conference on Web and Social Media,Vol\.16,pp\. 794–805\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p1.1),[§1](https://arxiv.org/html/2606.18268#S1.p2.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- D\. A\. Scheufele and N\. M\. Krause \(2019\)Science audiences, misinformation, and fake news\.Proceedings of the National Academy of Sciences116\(16\),pp\. 7662–7669\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- P\. Smeros, C\. Castillo, and K\. Aberer \(2021\)Sciclops: detecting and contextualizing scientific claims for assisting manual fact\-checking\.InProceedings of the 30th ACM international conference on information & knowledge management,pp\. 1692–1702\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p2.1)\.
- K\. Stenhouse, M\. Roumeliotis, P\. Ciunkiewicz, R\. Banerjee, S\. Yanushkevich, and P\. McGeachy \(2021\)Development of a machine learning model for optimal applicator selection in high\-dose\-rate cervical brachytherapy\.Frontiers in Oncology11,pp\. 611437\.Cited by:[§4\.3](https://arxiv.org/html/2606.18268#S4.SS3.p1.1)\.
- I\. Sucholutsky, L\. Muttenthaler, A\. Weller, A\. Peng, A\. Bobu, B\. Kim, B\. C\. Love, C\. J\. Cueva, E\. Grant, I\. Groen,et al\.\(2025\)Getting aligned on representational alignment\.Transactions on Machine Learning Research2025\.Cited by:[§C\.2](https://arxiv.org/html/2606.18268#A3.SS2.p1.2),[§5\.4](https://arxiv.org/html/2606.18268#S5.SS4.p1.12)\.
- Y\. Wang, R\. G\. Reddy, Z\. M\. Mujahid, A\. Arora, A\. Rubashevskii, J\. Geng, O\. M\. Afzal, L\. Pan, N\. Borenstein, A\. Pillai,et al\.\(2024\)Factcheck\-bench: fine\-grained evaluation benchmark for automatic fact\-checkers\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 14199–14230\.Cited by:[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p1.1),[§7](https://arxiv.org/html/2606.18268#S7.p2.1)\.
- G\. Warren, I\. Shklovski, and I\. Augenstein \(2025\)Show me the work: fact\-checkers’ requirements for explainable automated fact\-checking\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–21\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p2.1),[§7](https://arxiv.org/html/2606.18268#S7.p2.1)\.
- D\. H\. Wolpert \(1992\)Stacked generalization\.Neural networks5\(2\),pp\. 241–259\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p2.1)\.
- J\. Wu, Z\. Fu, H\. Wang, F\. Li, J\. Guo, P\. Nakov, and M\. Kan \(2025\)Beyond the crowd: llm\-augmented community notes for governing health misinformation\.arXiv preprint arXiv:2510\.11423\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst conference on language modeling,Cited by:[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p1.1)\.
- A\. H\. Wynn, I\. Sucholutsky, and T\. L\. Griffiths \(2024\)Learning human\-like representations to enable learning human values\.Advances in Neural Information Processing Systems37,pp\. 30230–30260\.Cited by:[§C\.2](https://arxiv.org/html/2606.18268#A3.SS2.p1.2),[§5\.4](https://arxiv.org/html/2606.18268#S5.SS4.p1.12)\.
- R\. Xing, P\. Nakov, T\. Baldwin, and J\. H\. Lau \(2026\)COMMUNITYNOTES: a dataset for exploring the helpfulness of fact\-checking explanations\.InFindings of the Association for Computational Linguistics: EACL 2026,pp\. 1390–1411\.Cited by:[§D\.3](https://arxiv.org/html/2606.18268#A4.SS3.p2.1),[§1](https://arxiv.org/html/2606.18268#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.18268#S4.SS1.p3.1),[§5\.8](https://arxiv.org/html/2606.18268#S5.SS8.p1.1)\.
- S\. Ye, D\. Kim, S\. Kim, H\. Hwang, S\. Kim, Y\. Jo, J\. Thorne, J\. Kim, and M\. Seo \(2023\)Flask: fine\-grained language model evaluation based on alignment skill sets\.arXiv preprint arXiv:2307\.10928\.Cited by:[§3\.4](https://arxiv.org/html/2606.18268#S3.SS4.p1.1)\.
- S\. Zhang, L\. Wang, S\. Li, Y\. Wu, Y\. Chuai, L\. Chen, X\. Yi, and H\. Li \(2026\)Collab: fostering critical identification of deepfake videos on social media via synergistic annotation\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,pp\. 1–21\.Cited by:[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.
- S\. Zhang, L\. Wang, D\. Shi, Y\. Chuai, J\. Chen, Y\. Chen, Y\. Wang, Y\. Wang, X\. Yi, and H\. Li \(2025\)Commenotes: synthesizing organic comments to support community\-based fact\-checking\.arXiv preprint arXiv:2509\.11052\.Cited by:[§1](https://arxiv.org/html/2606.18268#S1.p3.1),[§7](https://arxiv.org/html/2606.18268#S7.p1.1)\.

## Appendix AGenerative AI Usage

In accordance with generative AI usage policies, we disclose the use of Generative AI tools\. We used generative AI as the base model for the experiment\. Besides, we utilized Google’s Gemini 3 Pro and ChatGPT \(i\.e\., GPT\-5\.2\) as writing assistants\. Its functions were limited to proofreading, language and clarity enhancement, conciseness, and word choice, and was not used to generate any core scientific content\. Authors hold full responsibility to the paper’s content\.

## Appendix BDataset Details

### B\.1Evaluation\-Set Sampling Details

We sampled evaluation dataset from the ComRate dataset, applying the following inclusion criteria: \(1\) having a valid note status, \(2\) non\-empty note text, \(3\) a valid corresponding post ID, and \(4\) available associated post text\. The primary evaluation set consists of 2,000 notes, sampled proportionally based on creation year, final status, and primary note category\. This approach aims to preserve the real\-world class distribution characteristics of the ComRate dataset\. Specifically, this set includes 1,775NMR, 149H, and 76NH\.

To enhance robustness, we additionally constructed a balanced dataset consisting of 1,998 annotations, wherein each class \(e\.g\.,NMR,H,NH\) contains 666 annotations\. Within each category of this balanced dataset, annotation samples were drawn proportionally based on creation year and primary note category to reflect diversity in both temporal and thematic distribution\.

In our evaluation process, we employed three non\-overlapping data splits: a training set used to construct personality profiles, a primary evaluation set comprising 2,000 notes, and a balanced robustness evaluation set comprising 1,998 notes\. The training set used for constructing personality profiles consists of historical Community Notes rating data\. Its purpose is to estimate the matrix\-factorized rater space, and to derive generalized personality profiles for each clustering level\. Both of the aforementioned evaluation sets are entirely independent of the training set used for personality profile construction, and are used only for out\-of\-fold aggregation, model comparison, and robustness evaluation\.

### B\.2Details for the Dataset Analysis

Additional details are provided for the descriptive analyses reported in Figure[1](https://arxiv.org/html/2606.18268#S2.F1)\. These analyses characterize ComRate from three aspects: temporal growth, rater heterogeneity, and note\-level distributional patterns\.

For the temporal analysis in Figure[1a](https://arxiv.org/html/2606.18268#S2.F1.sf1), we aggregate the dataset by year\. Each year, we compile statistics on the unique number of community notes, the unique number of posts associated with these notes, and the total count of rating records\. The statistics for 2026 are calculated using data available up to April 5, 2026, and therefore represent only a partial year’s data\. Analysis show that the Community Notes program has expanded quickly, particularly since 2023 when both the volume of notes and ratings grew quickly\.

For the rater\-cluster analysis in Figure[1b](https://arxiv.org/html/2606.18268#S2.F1.sf2), we used contributor representations learned from the biased rank\-one matrix factorization model described in Section[3](https://arxiv.org/html/2606.18268#S3)\. Each observed rating is mapped to a numerical helpfulness value, from which the model estimates contributor\-specific intercepts and latent factors within the contributor\-note rating matrix\. The contributor intercepts reflect the rater’s overall leniency or strictness, while the latent factors capture residual variation in rating behavior after accounting for global and note\-level effects\. We clustered the contributors into 16 rater groups, corresponding to the 16 role agents employed in the MultiCom setting\. For each group, we summarized both matrix\-factorization features and interpretable behavioral statistics\. Specifically, the heatmap includes seven cluster\-level statistics:

∙\\bulletRater intercept: the average contributor bias term from the biased matrix\-factorization model\.

∙\\bulletLatent factor: the average contributor latent coordinate\.

∙\\bulletAgreement: the average agreement tendency of contributors in the cluster\.

∙\\bulletMean note score: the average score of notes rated by contributors in the cluster\.

∙\\bulletHelpful shareandnot\-helpful share: the average proportions of ratings marked helpful and not helpful\.

∙\\bulletNotes authored: the average note\-authoring activity of contributors in the cluster\.

Figure[1b](https://arxiv.org/html/2606.18268#S2.F1.sf2)visualizes these group\-level statistics after feature\-wise standardization\. For each featurekk, letxc,kx\_\{c,k\}denote the raw cluster\-level value of featurekkfor clustercc, and then standardize it across the 16 clusters:

zc,k=xc,k−x¯ksk,z\_\{c,k\}=\\frac\{x\_\{c,k\}\-\\bar\{x\}\_\{k\}\}\{s\_\{k\}\},wherex¯k\\bar\{x\}\_\{k\}andsks\_\{k\}are the mean and standard deviation of featurekkacross clusters\. In the heatmap, each row corresponds to a rater cluster, each column corresponds to one behavioral statistic, and the color indicates the standardized value of the specific cluster’s statistics\. Red cells indicate higher values, while blue cells indicate lower values\.

The results indicate that the raters are heterogeneous\. Some groups are more inclined to rate helpful, while others are stricter\. There are also differences in consistency, or rating numbers\. These observed differences provide an empirical basis for MultiCom’s role\-guided design: the simulated agents represent distinct rating behavior\.

For Figure[1c](https://arxiv.org/html/2606.18268#S2.F1.sf3), we computed \(1\) distribution of note categories, and \(2\) notes per post\. First, each note was assigned to primary misleading / not misleading categories based on its metadata fields\. Note that one note could have multiple corresponding categories\. For each category, we calculated the number of notes it contained as well as its corresponding proportion of the total\. Second, we calculated the “note\-to\-post” ratio by grouping notes according to their associated post IDs and counting the number of notes attached to each individual post\.

### B\.3Additional Dataset Statistics

We provide additional descriptive statistics for the full ComRate note collection\. These analyses are computed over all 2,566,644 notes in ComRate, rather than only the 2,000\-note evaluation subset\. Because the official Community Notes release does not provide complete post text for all notes, the full\-dataset language/script and length statistics are computed from note text and official note metadata\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/x6.png)Figure 6:Distributions of note language/script and note characters length in ComRate\. Panel \(a\) reports the dominant language or script category for each note\. Panel \(b\) reports the characters length of note text, with dashed lines indicating the median and 90th percentile\.Figure[6](https://arxiv.org/html/2606.18268#A2.F6)summarizes the language/script and character\-length distributions of the detected ComRate notes\. ComRate is predominantly composed of English notes, accounting for 62\.75% of all annotations\. Other common Latin\-script languages include Spanish \(8\.88%\), Portuguese \(5\.67%\), French \(4\.66%\), German \(2\.03%\), Turkish \(1\.24%\), and Polish \(0\.77%\)\. Non\-Latin scripts also appear in significant quantities, including Japanese \(6\.78%\), Chinese \(1\.47%\), and Arabic \(0\.63%\)\.

This distribution indicates that while ComRate is primarily English\-centric, it also encompasses multilingual and cross\-script examples drawn from the global Community Notes ecosystem, which increases the generalizability of our results\.

The distribution of note lengths reveals that Community Notes are generally concise\. The average length of a note is 279\.7 characters, with a median of 262 characters and a 90th percentile of 472 characters\. The 25th and 75th percentiles are 161 and 358 characters, suggesting that the majority of notes provide concise contextual explanations rather than lengthy fact\-checking articles\.

## Appendix CDetailed Experiment Results

### C\.1Robustness on Instable Notes

We analyzed robustness on instable notes in evaluation setsChuaiet al\.\([2026a](https://arxiv.org/html/2606.18268#bib.bib11)\), with results shown in Table[4](https://arxiv.org/html/2606.18268#A3.T4)\. First, we identify notes with status\-transition signals using the official status\-history summary fields\. A note is treated as a status\-transition case if it has evidence of a previous resolved status that differs from the current status, or if the official status fields disagree across core, expansion, group, locked, current, first resolved, and most recent resolved statuses\. This provides a conservative proxy for volatile notes whose public status may have changed over time\. Second, we split notes by the author\-providedtrustworthySourcesfield in the official note metadata\.

Table 4:MultiCom performance on status\-transition and trustworthy\-source subsets\. “Status\-transition” is derived from official status\-history summary fields and should be interpreted as a proxy for volatile note status\.The status\-transition subset has lower overall accuracy than the full evaluation set, suggesting that notes with unstable or changing official status are more difficult cases\. However, its balanced accuracy remains close to the overall result because this subset contains a larger fraction of resolvedHelpfulandNot Helpfulnotes\.

The trustworthy\-source split shows that source metadata changes the difficulty profile rather than producing a simple monotonic effect\. Notes withtrustworthySources=1 form the majority of the evaluation set and include more resolvedHelpfulexamples, while notes withtrustworthySources=0 are more dominated byNeeds More Ratings\. MultiCom obtains higher overall accuracy on the no\-trustworthy\-source subset, largely because most of these examples are unresolved, but this subset contains very fewHelpfulnotes, making resolved\-class recall less stable\.

### C\.2Representational Similarity Analysis

We quantify structural alignment between agent and human rating behaviors at the cluster level using representational similarity analysis \(RSA\), a standard approach for comparing representational similarities across humans and AIsSucholutskyet al\.\([2025](https://arxiv.org/html/2606.18268#bib.bib32)\); Wynnet al\.\([2024](https://arxiv.org/html/2606.18268#bib.bib33)\)\. Let𝒞=\{0,…,15\}\\mathcal\{C\}=\\\{0,\\dots,15\\\}denote the 16 clusters\. For each pair\(i,j\)∈𝒞2\(i,j\)\\in\\mathcal\{C\}^\{2\}, we define a shared note set

Si​j=𝒩iH∩𝒩jH∩𝒩iA∩𝒩jA,S\_\{ij\}=\\mathcal\{N\}^\{H\}\_\{i\}\\cap\\mathcal\{N\}^\{H\}\_\{j\}\\cap\\mathcal\{N\}^\{A\}\_\{i\}\\cap\\mathcal\{N\}^\{A\}\_\{j\},where𝒩kH\\mathcal\{N\}^\{H\}\_\{k\}and𝒩kA\\mathcal\{N\}^\{A\}\_\{k\}are notes rated by human clusterkkand agent clusterkk, respectively\. We retain a pair only if\|Si​j\|≥30\|S\_\{ij\}\|\\geq 30to ensure reliable distance estimates\.

Human note\-level ratings are encoded as\{0,0\.5,1\}\\\{0,0\.5,1\\\}fornot\_helpful,somewhat\_helpful, andhelpful\. Agent note\-level ratings use the corresponding scalar predicted helpfulness score\. For each retained pair\(i,j\)\(i,j\), we compute dissimilarity as the absolute difference in cluster\-wise mean score on the shared set:

Di​jH=\|μiH​\(Si​j\)−μjH​\(Si​j\)\|,D^\{H\}\_\{ij\}=\\left\|\\mu^\{H\}\_\{i\}\(S\_\{ij\}\)\-\\mu^\{H\}\_\{j\}\(S\_\{ij\}\)\\right\|,
Di​jA=\|μiA​\(Si​j\)−μjA​\(Si​j\)\|\.D^\{A\}\_\{ij\}=\\left\|\\mu^\{A\}\_\{i\}\(S\_\{ij\}\)\-\\mu^\{A\}\_\{j\}\(S\_\{ij\}\)\\right\|\.This produces a human representational dissimilarity matrix \(RDM\)DHD^\{H\}and an agent RDMDAD^\{A\}\. RSA is then computed as Spearman correlation between upper\-triangular entries ofDHD^\{H\}and the permutedDAD^\{A\}\.

Significance is assessed by a label\-permutation test \(2,0002\{,\}000permutations\) on the agent RDM\. We obtain

rs=0\.459,p=0\.0015,r\_\{s\}=0\.459,\\quad p=0\.0015,\(Figure[4](https://arxiv.org/html/2606.18268#S5.F4)and[7](https://arxiv.org/html/2606.18268#A3.F7)\), indicating statistically significant structural alignment between agent and human cluster\-level rating structures\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/Figure/rsa_appendix_rdms.png)Figure 7:Representation dissimilarity matrices for humans and agents\.We further examined the generalizability of these alignments between humans and agents\. Using the same methods, we found significant representational similarities among different models and humans, as in Figure[8](https://arxiv.org/html/2606.18268#A3.F8)\. Specifically, claude has the highest representational alignment correlation \(r=0\.714r=0\.714,p=\.0005p=\.0005\), while qwen has the lowest \(r=0\.329r=0\.329,p=\.0135p=\.0135\)\. However, all these scores are significant, indicating strong structural alignments between humans and agents\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/Figure/rsa_7models_hungarian_clusterlevel_from_intermediate.png)Figure 8:The representational alignment between models and humans across different models, where points corresponded to each number in the corresponding upper triangular indices of the RDM matrices\.This trend also generalizes to the prediction on future years \(e\.g\., future 1 year to future 3 years\)\. Using still the same methods, we found significant representational similarities among different models and humans, as in Figure[9](https://arxiv.org/html/2606.18268#A3.F9)\. Specifically, these settings feature similar representational alignment correlation and similar significant results \(pred 1 year:r=0\.376r=0\.376,p=\.0015p=\.0015; pred 2 years:r=0\.376r=0\.376,p=\.002p=\.002; pred 3 years:r=0\.376r=0\.376,p=\.0015p=\.0015\)\. This shows the robustness of our persona simulation method\.

![Refer to caption](https://arxiv.org/html/2606.18268v1/Figure/rsa_aligned_three_panel_horizontal_finalstyle.png)Figure 9:The representational alignment between models and humans across different temporal predictive settings, where points corresponded to each number in the corresponding upper triangular indices of the RDM matrices\.
### C\.3Balanced\-set Results

Because the original evaluation set is highly imbalanced, we additionally evaluate MultiCom, two ablated variants, and a single\-agent baseline on a sampled balanced 1,998\-note set, where the ratio ofH/NH/NMRis 1:1:1\. Table[5](https://arxiv.org/html/2606.18268#A3.T5)reports performance by misinformation category\. MultiCom consistently outperforms the single\-agent baseline and both ablations, achieving 68\.0% accuracy, 68\.0% balanced accuracy, and 67\.3% Macro\-F1 on average\.

Table 5:Ablation and single\-agent baseline results on the balanced set\. Subscripts denote binomial standard errors for accuracy\.We further evaluate temporal transfer on the balanced 1,998\-note set\. ForPred Year, models are trained on notes from yearttand evaluated on notes from yeart\+1t\+1\. ForPred 2Years, models are trained on two consecutive years and evaluated on the following two years\. ForPred 3Years, models are trained on three consecutive years and evaluated on the following three years\. Table[6](https://arxiv.org/html/2606.18268#A3.T6)reports temporal\-transfer accuracy by misinformation category\.

Table 6:Temporal transfer results on the balanced 1,998\-note set\. Subscripts denote binomial standard errors for accuracy\.

## Appendix DMethodology Details

### D\.1Aggregation Feature Views and OOF Predictors

Table[7](https://arxiv.org/html/2606.18268#A4.T7)summarizes the feature views used by the MultiCom aggregator\. All learned aggregation components are trained in an out\-of\-fold manner: for each evaluation note, intermediate predictions used by the final ensemble are produced by models that were not trained on that note\.

Table 7:Feature views used in the calibrated aggregation process\.The final hard ensemble combines five complementary OOF predictors:oof\_ensemble\_weighted,oof\_ensemble\_gated,xstyle\_rescue\_gate,blend, andfull\_meta\. Their weights are empirically optimized as 1\.0, 0\.75, 0\.75, 2\.0, and 1\.0, respectively\. We additionally apply a promotion rule for conservativeNMRpredictions: when the rationale/metadata blend and structured\-summary predictor agree on the same resolved label, and vote\-level diagnostic thresholds are satisfied, the prediction is promoted fromNMRto that resolved label\. For the temporal generalization experiment presented in Table[2](https://arxiv.org/html/2606.18268#S5.T2), we construct year\-based rolling train\-test splits based on the creation year of each note\. This setting differs from the primary 5\-fold out\-of\-fold evaluation approach because under this configuration, test notes are always temporally later than training notes\. ForPred Year, candidate windows are defined ast→t\+1t\\rightarrow t\+1, where the model is trained and calibrated on notes from yearttand evaluated on notes from yeart\+1t\+1\. ForPred 2Years, candidate windows are\(2021,2022\)→\(2023,2024\)\(2021,2022\)\\rightarrow\(2023,2024\),\(2022,2023\)→\(2024,2025\)\(2022,2023\)\\rightarrow\(2024,2025\), and\(2023,2024\)→\(2025,2026\)\(2023,2024\)\\rightarrow\(2025,2026\)\. ForPred 3Years, candidate windows are\(2021,2022,2023\)→\(2024,2025,2026\)\(2021,2022,2023\)\\rightarrow\(2024,2025,2026\)\.

Within each training window, all learned aggregation components are fitted solely on notes from the training years\. The component predictors are calibrated through out\-of\-fold splits internal to the training window, and the aforementioned fixed hard ensemble rules are applied to the test windows corresponding to future years subsequently\. Notes from the future test years are strictly excluded from the fitting of component models, the selection of thresholds, and the application of boosting rules\. If the number of samples from all three status categories contained within the training years is insufficient to support internal calibration, the corresponding candidate window is skipped\. Finally, prior to computing the category\-level accuracies reported in Table[2](https://arxiv.org/html/2606.18268#S5.T2), predictions from all feasible temporal windows under the same setting are pooled\.

### D\.2Auxiliary OOF Predictors and Promotion Rule

The final ensemble uses several out\-of\-fold predictors, each trained without access to the held\-out notes in its fold\. The summary\-metadata predictor takes note\-level summary features as input, including vote shares, vote entropy, vote margin, confidence statistics, agreement rates, helpfulness\-reason rates, not\-helpfulness\-reason rates, and metadata\-derived OOF probabilities\. The structured\-summary predictor uses the same summary\-metadata view, augmented with structured diagnostic features aggregated from agent outputs, including rationale length, reader\-understanding scores, persona\-level vote counts, and cluster\-level vote counts\. The full\-metadata predictor further includes per\-agent label indicators and per\-agent diagnostic scores\. The rationale\-text predictor uses concatenated agent rationales with lightweight text features\. The blend predictor is an OOF probability blend of the summary\-metadata, structured\-summary, full\-metadata, and rationale\-text predictors\.

For the final hard ensemble, we combine five OOF anchors:oof\_ensemble\_weighted,oof\_ensemble\_gated,xstyle\_rescue\_gate,blend, andfull\_meta, with weights 1\.0, 0\.75, 0\.75, 2\.0, and 1\.0, respectively\. After this ensemble prediction, we apply a conservative promotion rule only when the initial prediction isNMR\. Specifically, if the blend predictor and the structured\-summary predictor agree on the same resolved label, and the label is notNMR, we promote the final prediction to that label only when the NMR vote share is at least 0\.56 and the mean reader\-understanding score is at most 29\.79\. These thresholds are applied to vote\-level diagnostics and are used to recover high\-confidence resolved cases while avoiding promotion based on a single auxiliary signal\.

### D\.3Fine\-tuned Baseline Implementation Details

We implement the fine\-tuned baseline as a direct three\-class classifier overNH,NMR, andH\. The backbone is Mistral\-7B\-Instruct\-v0\.3, fine\-tuned with LoRAHuet al\.\([2022](https://arxiv.org/html/2606.18268#bib.bib30)\)\. The fine\-tuning set contains 106,611 ComRate examples with available post text, note text, and official status labels, including 4,211NH, 94,538NMR, and 7,862Hexamples\. Each input concatenates the post, the community note, and a three\-way classification instruction\.

Similar to Xing et al\.Xinget al\.\([2026](https://arxiv.org/html/2606.18268#bib.bib6)\), we pool the final\-token hidden representation from the backbone and feed it into a linear classification head\. Training uses weighted cross\-entropy with square\-root inverse class weights and a weighted random sampler\. We use five\-fold cross\-validation for determining best checkpoints\. In each fold, the held\-out fold is used for testing\.

We empirically determine the hyperparameters as follows: maximum sequence length 512, batch size 2, gradient accumulation 4, learning rate2×10−42\\times 10^\{\-4\}, weight decay 0\.01, warmup ratio 0\.1, 3 epochs, LoRA rank 16, LoRA alpha 32, and LoRA dropout 0\.05\.

### D\.4Generalizabilty Setting Justifications

We evaluated three aspects of generalizability, across different models, across different temporal aspects, and across different agent numbers\.

For different models, we tested models with different brands regions\. These potentially resulted in different processing capabilities in different language environments\. Besides, different brands’ models may have different reply tendencies, or even different personalities, experiments with different models could test whether the persona simulation framework is generalizable\. Therefore, we selected GPT\-4o, GPT\-5\.1, Claude Opus 4\.6, Qwen3\-Max and DeepSeek\-v3\.2\. Notably, this experiment is not meant to be exhaustive, and we note that there are varied open\-sourced models, which we could expand to test in the future\.

Besides, for generalizability along the temporal aspect, as the dataset starts from 2021 to 2026, we decided that the longest temporal window is to use the past three years data \(2021\-2023\) to predict the next three years data \(2024\-2026\)\. Therefore, we dedicated the longest prediction time window isPred 3Years, and thereby designed three settings:Pred Year,Pred 2Years,Pred 3Years\. For these settings, we used all past data for training or persona calibration, and predicted on all the following years’ \(e\.g\., 1 year, 2 years or 3 years\) notes\.

Finally, we evaluated scaling behavior by configuring the system with 16, 32, and 48 agents\. We selected this range because scaling from 16 to 48 agents already revealed distinct performance trends: a consistent increase inH/NHprecision and overall accuracy, alongside a decline in balanced accuracy\. Crucially, even the 16\-agent configuration substantially outperformed the 1\-agent baseline across all three metrics, establishing 16 agents as a highly effective baseline setting\. Conversely, scaling beyond 48 agents \(e\.g\., to 64 or more\) incurs prohibitive computational and financial costs for single\-note prediction\. We therefore restricted our evaluation to these three configurations to balance performance insights with practical efficiency\.

### D\.5Fine\-grained Label Predictions

For fine\-grained label prediction, we use the structured outputs already generated by MultiCom, including vote distributions, confidence, helpfulness and quality signals, failure mode signals, and disagreement patterns\. Subsequently, we train an out\-of\-fold multi\-label classifier to predict the official reason labels for each note\. This setup is designed to assess whether the diagnostic dimensions extracted from simulated raters align with the explanatory labels provided by human annotators\. Although MultiCom is capable of generating a broader spectrum of diagnostic signals regarding justification levels during the agent simulation process, four official prediction labels were excluded from the primary evaluation because their low frequency in the 2,000\-note evaluation set, making their precision, recall, and F1 scores unstable\.

Similar Articles

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Hugging Face Daily Papers

TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.

COMPOSITE-Stem

arXiv cs.CL

COMPOSITE-STEM introduces a benchmark of 70 expert-curated agentic tasks across physics, biology, chemistry, and mathematics, designed to evaluate AI agents on scientific workflows beyond saturated benchmarks. The top-performing model (Claude Opus 4.6) achieves only 21.4%, demonstrating significant capability gaps in scientific reasoning.