Can LLM Teams Play What? Where? When?

arXiv cs.CL Papers

Summary

This paper investigates whether team-based interaction improves LLM performance in the quiz game 'What? Where? When?' (ChGK). Using six recent open LLMs on a 2025 dataset of 572 questions, they show that team strategies (voting, silent captain, talkative captain) outperform single models by up to 20 percentage points, with the best team achieving 44.23% accuracy, approaching human performance.

arXiv:2605.30459v1 Announce Type: new Abstract: Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics. Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:23 AM

# Can LLM Teams Play What? Where? When?
Source: [https://arxiv.org/html/2605.30459](https://arxiv.org/html/2605.30459)
Anastasia Kotelnikova Vyatka State University Kirov, Russia kotelnikova\.av@gmail\.com &Viktor Byzov Vyatka State University Kirov, Russia vbyzov@yandex\.ru Maria Dolzhenkova Vyatka State University Kirov, Russia maryd@vyatsu\.ru &Evgeny Kotelnikov European University at St\. Petersburg St\. Petersburg, Russia kotelnikov\.ev@gmail\.com

###### Abstract

Large language models \(LLMs\) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing\. We investigate whether team\-based interaction improves LLM performance in What? Where? When? \(ChGK\), a quiz game designed to reward collective reasoning\. We introduce three team strategies: Voting, Silent Team \(the captain observes final answers\), and Talkative Team \(the captain observes both answers and rationales\)\. To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025\.

Using six recent large\-scale open models, we show that team\-based strategies outperform single\-model baselines, yielding gains of up to 20 percentage points in accuracy\. The best team achieves 44\.23% accuracy, and approaches human team performance on questions with available human statistics\. Analysis of inter\-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops\. We further examine captain behavior and find no evidence of self\-preference bias; access to peer rationales improves captain judgments\.

Overall, LLM teams function primarily as answer selection and error\-filtering mechanisms rather than generators of novel solutions\. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi\-agent systems\.

Keywords:Large language model, quiz games, multi\-agent systems, collective intelligence, LLM\-as\-a\-Judge

DOI:

###### Аннотация

Большие языковые модели \(LLM\) по\-прежнему испытывают ограничения при решении задач, требующих неявных рассуждений, культурных знаний и координированной проверки гипотез\. В работе исследуется, улучшает ли моделирование командного взаимодействия результаты LLM в игре «Что? Где? Когда?»\. Мы рассматриваем три стратегии организации LLM в команду: «голосование», «молчаливая команда» \(капитан видит только ответы членов команды\) и «разговорчивая команда» \(капитан видит и ответы, и рассуждения\)\. Чтобы минимизировать утечки данных, мы оцениваем эти стратегии на датасете, состоящем из 572 вопросов, выложенных в 2025 году\.

Используя шесть современных открытых LLM, мы показываем, что командные стратегии превосходят одиночные модели, обеспечивая прирост точности до 20 процентных пунктов\. Наиболее успешная команда достигает точности 44,23% и приближается к результатам человеческих команд на вопросах, для которых доступны статистические данные ответов людей\. Анализ разнообразия ответов показывает, что рост расхождений между моделями связан со снижением точности, однако обмен рассуждениями смягчает падение качества\. Также мы исследуем поведение капитана и не обнаруживаем эффекта предпочтения собственного ответа; доступ к рассуждениям членов команды повышает качество решений капитана\.

В целом, команды LLM выступают прежде всего как механизм отбора ответов и фильтрации ошибок, а не источник принципиально новых решений\. Наши результаты подтверждают значимость взаимодействия между моделями и перспективность адаптивных стратегий для многоагентных систем\.

Ключевые слова:Большие языковые модели, викторины, многоагентные системы, коллективный интеллект, LLM в роли судьи

Могут ли команды больших языковых моделей играть в "Что? Где Когда?"

## 1Introduction

Even strong language models struggle with indirect cultural, metaphorical, or multi\-step reasoning\. A potential solution is team\-based inference: multiple models answer independently, and a designated captain model aggregates their responses into a final decision\.

Consider the following question from a quiz competition:

\{examples\}

"One of Adriano Celentano’s songs consists of meaningless words imitating the sound of English\. In an interview, he stated that the song reflects social fragmentation and mentioned a city\. Which city?"

In a six\-model team simulation, four models proposedNew York, one suggestedMilan, and only one produced the correct answer,Babylon\. After reviewing all responses, the captain revised its initial prediction and selectedBabylon, recognizing the reference to the “Babylonian confusion of tongues\.” This example illustrates how minority insights can become decisive when properly aggregated\.

We evaluate collective reasoning in the context ofWhat? Where? When?\(Russian:Chto? Gde? Kogda?;ChGK\), a Russian team\-based intellectual game characterized by riddle\-like questions requiring indirect inference, linguistic sensitivity, and cultural knowledge\.111[https://en\.wikipedia\.org/wiki/What%3F\_Where%3F\_When%3F](https://en.wikipedia.org/wiki/What%3F_Where%3F_When%3F)Unlike standard QA benchmarks focused on factual retrieval or localized reasoning\[[11](https://arxiv.org/html/2605.30459#bib.bib12)\], ChGK questions are designed for collaborative problem solving, encouraging hypothesis comparison and correction of misleading interpretations\[[4](https://arxiv.org/html/2605.30459#bib.bib11)\]\. This makes ChGK a natural benchmark for collective intelligence\.

Single LLMs exhibit well\-documented limitations in complex reasoning tasks, including hallucinations\[[1](https://arxiv.org/html/2605.30459#bib.bib13)\], overconfidence\[[5](https://arxiv.org/html/2605.30459#bib.bib14)\], narrow reasoning trajectories\[[18](https://arxiv.org/html/2605.30459#bib.bib16)\], and limited self\-criticism\[[14](https://arxiv.org/html/2605.30459#bib.bib15)\]\. While methods such as chain\-of\-thought prompting\[[19](https://arxiv.org/html/2605.30459#bib.bib17)\], self\-reflection\[[20](https://arxiv.org/html/2605.30459#bib.bib18)\], and iterative refinement\[[16](https://arxiv.org/html/2605.30459#bib.bib19)\]improve internal reasoning, they remain constrained by the biases of a single model\.

Recent work has therefore shifted toward ensemble\[[2](https://arxiv.org/html/2605.30459#bib.bib7)\]and multi\-agent approaches\[[12](https://arxiv.org/html/2605.30459#bib.bib10)\], where multiple models exchange information and aggregate predictions\. This paradigm aligns naturally with ChGK\-style tasks, which reward diversity of viewpoints and penalize premature convergence on plausible but incorrect answers\. Human ChGK teams provide a useful analogue: they distribute cognitive roles, compare competing hypotheses, and sometimes rely on minority insights supported by key evidence\.

We investigate whether explicit modeling of team interaction improves LLM performance on ChGK questions\. We compare three team strategies:Voting\(majority aggregation\),Silent Team\(captain observes final answers only\), andTalkative Team\(captain observes answers and intermediate reasoning\)\. These configurations disentangle the roles of diversity, explanation, and coordination in collective decision\-making\.

Our contributions are as follows:

- •We introduce three team\-based interaction paradigms capturing different levels of information sharing\.
- •We construct a new evaluation dataset of 572 ChGK questions from 2025, designed to minimize data leakage\.
- •Using six recent large\-scale open models, we show that team\-based strategies consistently outperform single\-model baselines, with gains of up to 20 percentage points in accuracy\.
- •We analyze disagreement and communication effects, showing that explanatory sharing is especially beneficial under high uncertainty\.
- •We examine captain decision behavior and demonstrate that access to peer rationales improves confidence calibration and reliability\.

## 2Previous Work

### 2\.1Ensemble and Multi\-Agent Approaches for LLM\-based Question Answering

Recent work shows that combining multiple LLMs through ensembling or structured multi\-agent interaction can substantially improve question answering performance\.

Bujnowski et al\.\[[2](https://arxiv.org/html/2605.30459#bib.bib7)\]proposed a heterogeneous LLM ensemble with confidence\-aware voting and arbitration, achieving strong results on tabular QA\. In the medical domain, Yang et al\.\[[17](https://arxiv.org/html/2605.30459#bib.bib8)\]demonstrated that question\-adaptive weighting of complementary models outperforms uniform aggregation\. Lu et al\.\[[8](https://arxiv.org/html/2605.30459#bib.bib20)\]showed that diversity\-aware ensembles consistently surpass single models, especially on complex reasoning tasks, while even simple majority voting can be effective in multimodal settings\[[10](https://arxiv.org/html/2605.30459#bib.bib9)\]\.

Beyond static aggregation, interactive multi\-agent frameworks have been introduced\. Pitre et al\.\[[12](https://arxiv.org/html/2605.30459#bib.bib10)\]proposed a debate\-based system in which agents iteratively exchange answers, explanations, and confidence estimates, leading to consistent improvements over single\-agent and standard ensemble baselines\.

Overall, prior work highlights the importance of diversity, adaptive aggregation, and structured interaction for improving the reliability and accuracy of LLM\-based QA systems\.

### 2\.2Datasets

Large quiz\-style datasets are commonly used to evaluate QA systems and LLMs\. Widely adopted resources include theJeopardy\! clue dataset222[https://github\.com/jwolle1/jeopardy\_clue\_dataset](https://github.com/jwolle1/jeopardy_clue_dataset),TriviaQA\[[6](https://arxiv.org/html/2605.30459#bib.bib1)\],SearchQA\[[3](https://arxiv.org/html/2605.30459#bib.bib2)\], andQANTA\[[13](https://arxiv.org/html/2605.30459#bib.bib3)\], which test knowledge retrieval, evidence\-based reasoning, and incremental inference in quiz\-like formats\.

For Russian\-language quiz games, the largest public resource isRussian Jeopardy\!\[[9](https://arxiv.org/html/2605.30459#bib.bib4)\], derived from db\.chgk\.info\. The curatedCheGeKasubset is included in the TAPE benchmark for few\-shot Russian language understanding\[[15](https://arxiv.org/html/2605.30459#bib.bib5)\]\. More recently,\[[7](https://arxiv.org/html/2605.30459#bib.bib6)\]released a dataset of 2,600 ChGK questions from the IQ Game platform \(2018–2025\) for evaluating open\-source LLMs\.

A key challenge of large quiz archives is data leakage, as many questions have been publicly available for years and may appear in pretraining corpora, particularly affecting closed\-book evaluation\. To mitigate this issue, we construct a new dataset consisting exclusively of questions collected from the IQ Game platform333[https://iqga\.me](https://iqga.me/)in 2025\. Restricting evaluation to recent and previously unused material provides a more reliable estimate of genuine generalization and reduces the risk that model performance is driven by memorization of widely circulated quiz content\.

## 3Methodology

This section describes the experimental framework used in our study\. We first describe the team strategies, which form the main methodological component of our approach, and then present the dataset, the models, and the evaluation protocol used in the experiments\.

### 3\.1Team Strategies

We compare three strategies for combining the outputs of six LLMs, one of which acts as the captain\.

#### Voting\.

In the Voting strategy, all six models answer the question independently \(see Appendix[A\.1](https://arxiv.org/html/2605.30459#A1.SS1)\)\. Since answers that mean the same thing may differ in wording, we useGemini\-2\.5\-flashto normalize and group semantically equivalent responses \(Appendix[A\.2](https://arxiv.org/html/2605.30459#A1.SS2)\)\.

After grouping, the answer supported by the largest number of models is selected\. In the case of a tie, the captain acts only as a tie\-breaker: if the captain’s original answer is among the tied options, it is selected; otherwise, one of the tied answers is chosen at random\. As a result, Voting performance may differ depending on which model is designated as the captain\.

#### Silent Team\.

In this setting, the captain receives the six answer variants and must decide on the final answer \(Appendix[A\.3](https://arxiv.org/html/2605.30459#A1.SS3)\)\. The captain does not know which answer was its own\. The captain may select one of the proposed answers or generate a new one if none appears correct\.

#### Talkative Team\.

The Talkative Team extends the Silent Team setup by also providing the captain with short reasoning statements produced by each model\. Thus, the captain sees both the proposed answers and brief explanations of how they were obtained\. Apart from this additional information, the procedure is identical to the Silent Team strategy \(Appendix[A\.4](https://arxiv.org/html/2605.30459#A1.SS4)\)\.

### 3\.2Dataset

Our dataset consists of 572 ChGK questions collected from the IQ Game platform in 2025\. All experiments were conducted in Russian, which is the original language of the questions\. English translations shown in the paper are provided for readability only\. All questions are text\-only and do not include images or other multimedia content\.

Each question is paired with a canonical answer, a set of accepted alternatives, and an explanatory comment outlining the intended reasoning\. When available, we also include human answer statistics showing how often the question was solved correctly by human teams\. Since this information is missing for 133 questions, analyses involving human statistics are conducted on the remaining 439 questions\.

An example question is presented in Figure[1](https://arxiv.org/html/2605.30459#S3.F1)\.

![Refer to caption](https://arxiv.org/html/2605.30459v1/figures/example1.png)Figure 1:English translation of an illustrative quiz question from the original Russian dataset\.
### 3\.3Models

Our team consists of six recent open\-source LLMs accessed via public APIs\. All models are large\-scale Mixture\-of\-Experts systems released in 2025\.

#### Qwen Family\.

- •Qwen3\-235B\-A22B\(April 2025\): 235B total parameters \(22B active\), general\-purpose instruction and reasoning model\.
- •Qwen3\-235B\-A22B\-Thinking\-2507\(July 2025\): reasoning\-oriented variant of the same architecture\.

#### DeepSeek Family\.

- •DeepSeek\-V3\.2\(December 2025\): 671B total parameters \(37B active\), general\-purpose model\.
- •DeepSeek\-R1\-0528\(May 2025\): 685B total parameters \(37B active\), optimized for reasoning\.

#### Kimi Family\.

- •Kimi\-K2\-Instruct\-0905\(September 2025\): 1T total parameters \(32B active\), optimized for long\-context instruction following\.
- •Kimi\-K2\-Thinking\(November 2025\): reasoning\-focused variant of the same architecture\.

All models are used in their publicly available inference configurations without additional fine\-tuning\. We apply default decoding settings with temperature fixed to zero to ensure deterministic outputs\.

This heterogeneous selection allows us to form a diverse team combining strengths in instruction following, long\-context processing, and reasoning\.

For each question, a model may generate up to five attempts\. These retries are used only to handle occasional formatting errors or generation failures\. If no valid answer is produced after five attempts, the question is marked as unanswered and counted as incorrect\.

### 3\.4Evaluation Protocol

We evaluate answers in two stages: \(1\) automatic string matching and \(2\) LLM\-based verification for unresolved cases\.

#### Stage 1: Automatic Matching\.

All answers are first preprocessed: we convert text to lowercase, remove diacritics, and apply lemmatization when specified in the dataset guidelines\. If the question allows free word order, tokens are reordered before comparison\.

The processed answer is then compared against the canonical answer and the set of accepted alternatives\. If a match is found, the answer is marked as correct and no further checks are performed\.

#### Stage 2: LLM\-as\-a\-Judge\.

Answers not resolved at the first stage are evaluated by a judge model\. We useGemini\-2\.5\-flash, following\[[7](https://arxiv.org/html/2605.30459#bib.bib6)\], where it showed strong performance in answer verification at relatively low cost\.

For each case, the judge assigns a binary label \(correct or incorrect\) based on the reference answers and their accepted variants\. The evaluation prompt is provided in Appendix[A\.5](https://arxiv.org/html/2605.30459#A1.SS5)\.

#### Metric\.

Performance is measured using accuracy, defined as the proportion of correct answers among all evaluated instances\.

## 4Results

Table[1](https://arxiv.org/html/2605.30459#S4.T1)reports accuracy for individual models and team\-based strategies\.

Table 1:Accuracy of individual models and team\-based strategies\. Best results in each column are shown inbold\.### 4\.1Single\-Model Performance

The strongest standalone result is achieved byQwen3\-235B\-A22B\-Thinking\(37\.41%\), followed byDeepSeek\-V3\.2\(35\.14%\)\.

Within the Qwen and Kimi families, reasoning\-oriented variants outperform their instruction\-focused counterparts\. For example,Qwen3\-235B\-A22B\-ThinkingexceedsQwen3\-235B\-A22Bby 4\.02 p\.p\., andKimi\-K2\-ThinkingsurpassesKimi\-K2\-Instructby 11\.01 p\.p\.

This pattern does not hold for DeepSeek:DeepSeek\-V3\.2outperforms the reasoning\-focusedDeepSeek\-R1\(35\.14% vs\. 29\.20%\)\. This suggests that overall scale, training setup, and release maturity may be as important as explicit reasoning specialization\.

### 4\.2Impact of Team\-Based Aggregation

All team strategies substantially outperform single models, with gains ranging from roughly 8 to 20 percentage points\.

The largest improvements occur for weaker baselines\. For instance,Kimi\-K2\-Instructimproves from 19\.76% individually to 41\.08% under voting\. These results show that aggregation effectively compensates for individual model limitations\.

To estimate the upper bound of team performance, we also compute the skyline accuracy, i\.e\., the proportion of questions for which at least one model in the six\-model pool produced a correct answer\. Out of 572 questions, at least one model answered correctly in 341 cases, corresponding to 59\.6% accuracy\. This result indicates a substantial gap between the current aggregation strategies and the maximal achievable performance obtainable through perfect answer selection\.

### 4\.3Comparison of Team Strategies

Voting and captain\-based approaches perform similarly overall, and no strategy consistently dominates\.

The Talkative Team achieves the best results for both Qwen models, withQwen3\-235B\-A22B\-Thinkingreaching the highest overall accuracy \(44\.23%\)\. ForKimi\-K2\-Thinking, Silent Team achieves 42\.13%, but Voting performs slightly better, reaching 43\.36% and yielding the strongest result for this model and among all Voting\-based teams\.

Overall, simple voting remains competitive, while captain\-based strategies can offer additional gains in certain configurations\.

### 4\.4Model Families and Reasoning Orientation

Qwen and DeepSeek models show relatively stable performance across settings, whereas the Kimi family exhibits greater variability between instruction and reasoning variants\.

Reasoning\-oriented models provide stronger single\-model baselines, particularly in the Qwen and Kimi families\. However, this advantage becomes smaller under team aggregation, as weaker instruction\-focused models benefit substantially from collaboration\.

In the DeepSeek family, the general\-purposeDeepSeek\-V3\.2consistently outperforms the reasoning\-specializedDeepSeek\-R1, reinforcing the idea that scale and training factors may outweigh explicit reasoning design\.

Overall, model family and specialization shape individual performance, but team\-based aggregation reduces these differences by stabilizing weaker models\.

## 5Discussion

### 5\.1Diversity of Model Answers and Team Performance

In the following analysis, we focus on team configurations whereQwen3\-235B\-A22B\-Thinking, the strongest individual model and the captain of the highest\-performing team overall, serves as the captain\.

We study how answer diversity affects team performance\. For each question, diversity is defined as the number of distinct answers produced by the six models \(d∈\{1,…,6\}d\\in\\\{1,\\dots,6\\\}\)\. Here,d=1d=1means full agreement andd=6d=6complete disagreement\. Figure[2](https://arxiv.org/html/2605.30459#S5.F2)shows team accuracy as a function ofdd\.

![Refer to caption](https://arxiv.org/html/2605.30459v1/figures/diversity_all.png)Figure 2:Team accuracy as a function of answer diversitydd\. Shaded bars indicate the number of questions \(nn\) corresponding to each value ofdd\.Across all strategies, accuracy decreases sharply as diversity increases\. When models agree \(d=1d=1\), performance exceeds 80%\. Under maximal disagreement \(d=6d=6\), it falls below 25%\.

The Talkative Team behaves differently in high\-disagreement cases\. Ford=5d=5andd=6d=6, it consistently outperforms Voting and Silent Team\. This suggests that short explanations help the captain interpret conflicting answers and recover useful signals even without consensus \(see Figure[3](https://arxiv.org/html/2605.30459#S5.F3)\)\.

![Refer to caption](https://arxiv.org/html/2605.30459v1/figures/example2.png)Figure 3:Example of a high\-diversity question \(d=6d=6\) with a correct Talkative Team answer\.However, at moderate disagreement \(d=3d=3\), Talkative slightly underperforms the other methods\. One possible explanation is that partially conflicting explanations introduce additional noise and complicate decision\-making\.

A natural question is whether diversity simply reflects question difficulty\. Harder questions might lead to more disagreement, while easier ones may produce consensus\. To test this, we analyze the 439 questions with available human statistics\.

Across these questions, human teams achieve an average per\-question solve rate of 49\.83%, while the best LLM configuration \(Talkative Team\) achieves 47\.61% accuracy\.

The Spearman correlation between diversityddand human success rate is weak \(ρ=−0\.0914\\rho=\-0\.0914\) and only marginally non\-significant \(p=0\.055p=0\.055\)\. This suggests that disagreement is not merely a proxy for overall difficulty\.

To explore this further, we divide the 439 questions into Easy, Medium, and Hard groups based on human performance \(Figure[4](https://arxiv.org/html/2605.30459#S5.F4)\)\. In all three groups, higher diversity remains associated with lower accuracy\. Thus, disagreement reflects instance\-level uncertainty more than coarse difficulty categories\.

![Refer to caption](https://arxiv.org/html/2605.30459v1/figures/diversity_3.png)Figure 4:Team and human accuracy as a function of answer diversity across different difficulty levels\.Importantly, ford≥5d\\geq 5, Talkative Team consistently outperforms the other strategies across all difficulty levels\. The benefit of explanatory communication therefore appears robust and most pronounced under strong inter\-model conflict\.

Overall, inter\-model disagreement serves as a useful signal for coordination\. Simple aggregation works well under low disagreement, whereas communication\-intensive strategies are most beneficial under high uncertainty\. This suggests that future systems could adapt their coordination strategy dynamically based on observed diversity\.

### 5\.2Self\-Preference Bias

We examine whether the captain systematically favors its own initial answer and whether this behavior leads to errors\.

In Talkative Team, the captain selects its own answer in 41\.08% of cases; in Silent Team – 47\.55% \(Table[2](https://arxiv.org/html/2605.30459#S5.T2)\)\. In both settings, self\-selection is associated with higher accuracy\. In Talkative Team, accuracy is 62\.13% when the captain keeps its own answer, compared to 31\.75% when it switches\. A similar pattern holds in Silent Team \(51\.47% vs\. 34\.33%\)\.

Table 2:Self\-preference behavior and conditional accuracy of the captain in Silent and Talkative teams\. Self\-choice denotes the proportion of cases in which the captain selected its own initial answer\.Self\-selection largely overlaps with majority agreement: in over 80% of such cases, the captain’s initial answer matches the most frequent answer in the group\. This suggests that apparent self\-preference often reflects alignment with the majority rather than disregard for alternative views\.

Although the captain is allowed to propose a new answer \(see Appendix[A\.4](https://arxiv.org/html/2605.30459#A1.SS4)\), this never occurred in practice\. All final decisions correspond to one of the initial model responses\.

The two strategies differ in how self\-selection is used\. In Talkative Team, the captain relies on its own answer less often, but these decisions are more accurate\. This indicates that access to peer rationales leads to more selective and better\-calibrated confidence\.

Overall, self\-preference does not appear to be a systematic source of error\. Instead, it reflects calibrated reliance on one’s initial judgment, especially when supported by group agreement\. Communication reduces the frequency of self\-selection while increasing its reliability\.

## 6Conclusion

We examined whether teams of large language models can approximate aspects of human collective intelligence in theWhat? Where? When?game\. Using several interaction strategies and a newly constructed dataset of recent questions, we studied how diversity, communication, and coordination affect performance\.

Across all model families, team\-based aggregation consistently outperformed single\-model baselines\. In every setting, the team achieved higher accuracy than the captain model alone, indicating that collective decision\-making provides systematic benefits beyond the strength of any individual model\.

At the same time, we observe clear limitations\. Although captains were allowed to generate new answers, they never proposed solutions outside the initial set of model responses\. Final decisions were always selections among existing candidates\. This suggests that current LLM teams primarily act as selection and error\-filtering mechanisms rather than systems capable of genuine collective creativity\.

Our results also show that coordination matters\. Explanatory communication is especially helpful under high disagreement, while simple voting remains competitive when disagreement is low\. This points to the potential of adaptive systems that adjust their interaction strategy based on observed uncertainty\.

Several directions remain open\. Future work could explore dynamic role assignment, multi\-round deliberation, explicit hypothesis generation, and more complex coordination protocols such as hierarchical or multi\-stage aggregation strategies\. Extending evaluation to other domains with indirect reasoning and ambiguity would further test the generality of these findings\.

Overall, LLM teams demonstrate meaningful forms of collective intelligence, but their behavior remains largely limited to aggregating individual outputs\. Enabling deeper collaborative reasoning remains an important challenge for future multi\-agent language systems\.

## References

- \[1\]\(2025\)Large language models hallucination: a comprehensive survey\.Computing Research RepositoryarXiv:2510\.06265\.External Links:[Link](https://arxiv.org/abs/2510.06265)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[2\]P\. Bujnowski, T\. Dryjanski,et al\.\(2025\-07\)Samsung research Poland at SemEval\-2025 task 8: LLM ensemble methods for QA over tabular data\.InProceedings of the 19th International Workshop on Semantic Evaluation \(SemEval\-2025\),S\. Rosenthal, A\. Rosá, D\. Ghosh, and M\. Zampieri \(Eds\.\),Vienna, Austria,pp\. 1223–1232\(english\)\.External Links:ISBN 979\-8\-89176\-273\-2Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p8.1),[§2\.1](https://arxiv.org/html/2605.30459#S2.SS1.p2.1)\.
- \[3\]M\. Dunn, L\. Sagun,et al\.\(2017\)SearchQA: a new Q&A dataset augmented with context from a search engine\.Computing Research RepositoryarXiv:1704\.05179\(english\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p1.1)\.
- \[4\]E\. J\. Foster, K\. J\. Friedlander, and P\. A\. Fine\(2025\)Mastermind and expert mind: a qualitative study of elite quizzers\.Journal of Expertise8\(1\),pp\. 38–71\.External Links:[Link](https://www.journalofexpertise.org/articles/volume8_issue1/JoE_8_1_Foster_etal.html)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p6.1)\.
- \[5\]J\. Guan, Q\. Chen, L\. Qin, D\. Peng, J\. Liu, L\. Huo, J\. Xie, and W\. Che\(2025\)Beware of reasoning overconfidence: pitfalls in the reasoning process for multi\-solution tasks\.Computing Research RepositoryarXiv:2512\.01725\.External Links:[Link](https://arxiv.org/abs/2512.01725)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[6\]M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer\(2017\-07\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 1601–1611\(english\)\.External Links:[Document](https://dx.doi.org/10.18653/v1/P17-1147)Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p1.1)\.
- \[7\]A\. V\. Kuznetsova, V\. A\. Byzov, I\. V\. Aslanov, and E\. V\. Kotelnikov\(2025\)Do open large language models know what, where, and when? A case study with quiz\-style questions\.Supercomputing Frontiers and Innovations12\(3\),pp\. 90–107\(english\)\.External Links:[Link](https://superfri.susu.ru/index.php/superfri/article/view/647),[Document](https://dx.doi.org/10.14529/jsfi250307)Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p2.1),[§3\.4](https://arxiv.org/html/2605.30459#S3.SS4.SSS0.Px2.p1.1)\.
- \[8\]D\. Lu, J\. Zhang, C\. Yuan, J\. Shao, and X\. Li\(2025\)The law of multi\-model collaboration: scaling limits of model ensembling for large language models\.Computing Research RepositoryarXiv:2512\.23340\.External Links:[Link](https://arxiv.org/abs/2512.23340)Cited by:[§2\.1](https://arxiv.org/html/2605.30459#S2.SS1.p2.1)\.
- \[9\]E\. Mikhalkova and A\. A\. Khlyupin\(2022\-06\)Russian jeopardy\! data set for question\-answering systems\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,N\. Calzolari, F\. Béchet, P\. Blache, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, J\. Odijk, and S\. Piperidis \(Eds\.\),Marseille, France,pp\. 508–514\(english\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p2.1)\.
- \[10\]T\. Nguyen\-Mau, N\. N\. Truc, N\. Hoang, M\. Tran, and H\. Nguyen\(2025\)Enhancing visual question answering with pre\-trained vision\-language models: an ensemble approach at the lava challenge 2024\.InComputer Vision – ACCV 2024 Workshops,M\. Cho, I\. Laptev, D\. Tran, A\. Yao, and H\. Zha \(Eds\.\),Singapore,pp\. 281–292\(english\)\.External Links:[Document](https://dx.doi.org/10.1007/978-981-96-2641-0%5F19),ISBN 978\-981\-96\-2641\-0Cited by:[§2\.1](https://arxiv.org/html/2605.30459#S2.SS1.p2.1)\.
- \[11\]S\. Ni, G\. Chen, S\. Li, X\. Chen, S\. Li, B\. Wang, Q\. Wang, X\. Wang, Y\. Zhang, L\. Fan, C\. Li, R\. Xu, L\. Sun, and M\. Yang\(2025\)A survey on large language model benchmarks\.Computing Research RepositoryarXiv:2508\.15361\.External Links:[Link](https://arxiv.org/abs/2508.15361)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p6.1)\.
- \[12\]P\. Pitre, N\. Ramakrishnan, and X\. Wang\(2025\-07\)CONSENSAGENT: towards efficient and effective consensus in multi\-agent LLM interactions through sycophancy mitigation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22112–22133\(english\)\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1141),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p8.1),[§2\.1](https://arxiv.org/html/2605.30459#S2.SS1.p3.1)\.
- \[13\]P\. Rodriguez, S\. Feng, M\. Iyyer, H\. He, and J\. L\. Boyd\-Graber\(2021\)Quizbowl: the case for incremental question answering\.Computing Research RepositoryarXiv:1904\.04792\(english\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p1.1)\.
- \[14\]K\. Stechly, K\. Valmeekam, and S\. Kambhampati\(2025\)On the self\-verification limitations of large language models on reasoning and planning tasks\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=4O0v4s3IzY)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[15\]E\. Taktasheva, T\. Shavrina,et al\.\(2022\-12\)TAPE: assessing few\-shot Russian language understanding\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 2472–2497\(english\)\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.183)Cited by:[§2\.2](https://arxiv.org/html/2605.30459#S2.SS2.p2.1)\.
- \[16\]E\. Xue, K\. Chen, Z\. Huang, Y\. Ji, and H\. Wang\(2025\)IMPROVE: iterative model pipeline refinement and optimization leveraging llm experts\.Computing Research RepositoryarXiv:2502\.18530\.External Links:2502\.18530,[Link](https://arxiv.org/abs/2502.18530)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[17\]H\. Yang, M\. Li,et al\.\(2025\-07\-14\)Large language model synergy for ensemble learning in medical question answering: design and evaluation study\.Journal of Medical Internet Research27,pp\. e70080\(english\)\.External Links:ISSN 1438\-8871,[Document](https://dx.doi.org/10.2196/70080)Cited by:[§2\.1](https://arxiv.org/html/2605.30459#S2.SS1.p2.1)\.
- \[18\]T\. Zheng, Y\. Chen, C\. Li, C\. Li, Q\. Zong, H\. Shi, B\. Xu, Y\. Song, G\. Wong, and S\. See\(2025\)The curse of cot: on the limitations of chain\-of\-thought in in\-context learning\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=7SIrvcYNYj)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[19\]D\. Zhu, X\. Wei, G\. Zhao, W\. Wu, H\. Zou, J\. Ran, X\. Wang, L\. Sun, X\. Zhang, and S\. Li\(2025\)Chain\-of\-thought matters: improving long\-context language models with reasoning path supervision\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 3197–3211\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.170/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.170),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.
- \[20\]X\. Zhu, J\. Jiang, M\. M\. Khalili, and Z\. Zhu\(2025\)From emergence to control: probing and modulating self\-reflection in language models\.Computing Research RepositoryarXiv:2506\.12217\.External Links:[Link](https://arxiv.org/abs/2506.12217)Cited by:[§1](https://arxiv.org/html/2605.30459#S1.p7.1)\.

## Appendix APrompts

This appendix provides the full text of the prompts used in our experiments\. Placeholders such as\{question\}and\{answer1\}denote fields populated at runtime\. All experiments were conducted in Russian, which is the original language of the dataset\. For readability, the prompts in the appendix are shown in English translation\.

### A\.1Prompt for generating individual answers

[⬇](data:text/plain;base64,WW91IGFyZSBwYXJ0aWNpcGF0aW5nIGluIGFuIGludGVsbGVjdHVhbCBnYW1lLiBQbGVhc2UgYnJpZWZseSByZWFzb24gYWJvdXQgdGhlIGZvbGxvd2luZyBxdWVzdGlvbiBhbmQgZ2l2ZSB0aGUgY29ycmVjdCBhbnN3ZXIuClF1ZXN0aW9uOiB7cXVlc3Rpb259LgpPdXRwdXQgdGhlIHJlYXNvbmluZyBhbmQgYW5zd2VyIGluIEpTT04gZm9ybWF0Ogp7CiAgInJlYXNvbmluZyI6ICJ5b3VyIHJlYXNvbmluZyBoZXJlIiwKICAiYW5zd2VyIjogInlvdXIgYW5zd2VyIGhlcmUiCn0=)Youareparticipatinginanintellectualgame\.Pleasebrieflyreasonaboutthefollowingquestionandgivethecorrectanswer\.Question:\{question\}\.OutputthereasoningandanswerinJSONformat:\{"reasoning":"yourreasoninghere","answer":"youranswerhere"\}

### A\.2Gemini prompt for answer normalizing

[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgZXZhbHVhdGluZyByZXNwb25zZXMgaW4gYW4gaW50ZWxsZWN0dWFsIGdhbWUuIFlvdXIgdGFzayBpcyB0byBhc3Nlc3MgdGhlIGRpdmVyc2l0eSBvZiBhbnN3ZXJzIHRvIHRoZSBxdWVzdGlvbnMuCkZvciBlYWNoIHF1ZXN0aW9uLCB5b3UgYXJlIGdpdmVuOgogICJpZCIgLSB0aGUgcXVlc3Rpb24gaWRlbnRpZmllciwKICAicXVlc3Rpb24iIC0gdGhlIHF1ZXN0aW9uIGl0c2VsZiwKICAiY29ycmVjdF9hbnN3ZXIiIC0gdGhlIGNvcnJlY3QgYW5zd2VyLAogICJ2YXJpYXRpb25zIiAtIGFjY2VwdGFibGUgYW5zd2VyIHZhcmlhdGlvbnMgdGhhdCBhcmUgY291bnRlZCBhcyBjb3JyZWN0LAogICJhbnN3ZXIxIiwgLi4uICJhbnN3ZXI2IiAtIHRoZSBhbnN3ZXJzIHlvdSBuZWVkIHRvIGV2YWx1YXRlLApSZXR1cm4gb25seSB0aGUgSlNPTiB3aXRob3V0IGFkZGl0aW9uYWwgY29tbWVudHMsIHdoZXJlIGVhY2ggYW5zd2VyIGlzIGEgZGljdGlvbmFyeSB3aXRoIHRoZSBrZXlzOgogICJpZCIgLSB0aGUgcXVlc3Rpb24gaWRlbnRpZmllciwKICAiYW5zd2VyIHZhcmlhbnRzIiAtIGEgZGljdGlvbmFyeSB3aGVyZSB0aGUga2V5cyBhcmUgc3RyaW5ncyBsaWtlICJ2YXJpYW50IDEiLCAidmFyaWFudCAyIiwgZXRjLiwgYW5kIHRoZSB2YWx1ZXMgYXJlIGxpc3RzIG9mIGFuc3dlcnMgdGhhdCBhcmUgbm90IHNlbWFudGljYWxseSBkaWZmZXJlbnQgZnJvbSBlYWNoIG90aGVyIG9yIGFyZSBzaW1wbHkgaWRlbnRpY2FsLgpEbyBub3QgZXZhbHVhdGUgdGhlIGNvcnJlY3RuZXNzIG9mIHRoZSBhbnN3ZXJzLCBidXQgdGhlaXIgc2VtYW50aWMgZGl2ZXJzaXR5IGFzIHJlc3BvbnNlcyB0byB0aGUgZ2l2ZW4gcXVlc3Rpb24uCkFsbCBzaXggYW5zd2VycyBtdXN0IGJlIHBsYWNlZCBpbnRvIG9uZSBvZiB0aGUgbGlzdHMuIExpc3RzIG1heSBjb250YWluIHJlcGV0aXRpb25zLgpDaGVjayB0aGF0IHlvdSBoYXZlIGV2YWx1YXRlZCBhbGwge2Fuc3dlcl9jb3VudH0gYW5zd2VycyBmcm9tIHRoZSBsaXN0LgpMaXN0IG9mIHF1ZXN0aW9ucyBhbmQgYW5zd2Vyczoge2Fuc3dlcnN9)Youareanexpertevaluatingresponsesinanintellectualgame\.Yourtaskistoassessthediversityofanswerstothequestions\.Foreachquestion,youaregiven:"id"\-thequestionidentifier,"question"\-thequestionitself,"correct\_answer"\-thecorrectanswer,"variations"\-acceptableanswervariationsthatarecountedascorrect,"answer1",\.\.\."answer6"\-theanswersyouneedtoevaluate,ReturnonlytheJSONwithoutadditionalcomments,whereeachanswerisadictionarywiththekeys:"id"\-thequestionidentifier,"answervariants"\-adictionarywherethekeysarestringslike"variant1","variant2",etc\.,andthevaluesarelistsofanswersthatarenotsemanticallydifferentfromeachotheroraresimplyidentical\.Donotevaluatethecorrectnessoftheanswers,buttheirsemanticdiversityasresponsestothegivenquestion\.Allsixanswersmustbeplacedintooneofthelists\.Listsmaycontainrepetitions\.Checkthatyouhaveevaluatedall\{answer\_count\}answersfromthelist\.Listofquestionsandanswers:\{answers\}

### A\.3Silent team prompt

[⬇](data:text/plain;base64,WW91IGFyZSBwYXJ0aWNpcGF0aW5nIGluIGFuIGludGVsbGVjdHVhbCBnYW1lLiBZb3UgYXJlIGdpdmVuIHRoZSBmb2xsb3dpbmcgcXVlc3Rpb24gYW5kIHNpeCBhbnN3ZXIgdmFyaWFudHMuIEl0IGlzIHVua25vd24gd2hldGhlciBhbnkgb2YgdGhlIHZhcmlhbnRzIGlzIGNvcnJlY3QuCiAgUXVlc3Rpb246IHtxdWVzdGlvbn0uCiAgQW5zd2VyIDE6IHthbnN3ZXIxfS4KICAuLi4KICBBbnN3ZXIgNjoge2Fuc3dlcjZ9LgpCcmllZmx5IHJldmlldyB0aGUgYW5zd2VyIHZhcmlhbnRzLgpJZiB0aGUgY29ycmVjdCBhbnN3ZXIgaXMgbGlzdGVkLCBzZWxlY3QgaXQuCklmIG5vbmUgb2YgdGhlIHN1Z2dlc3RlZCB2YXJpYW50cyBpcyBjb3JyZWN0LCBzdWdnZXN0IHlvdXIgb3duIGNvcnJlY3QgYW5zd2VyLgpQcm92aWRlIHRoZSByZXN1bHQgaW4gSlNPTiBmb3JtYXQ6CnsKICAicmVhc29uaW5nIjogInlvdXIgYnJpZWYgdGhvdWdodHMgYXJlIGhlcmUiLAogICJhbnN3ZXIiOiAieW91ciBhbnN3ZXIgaGVyZSIKfQ==)Youareparticipatinginanintellectualgame\.Youaregiventhefollowingquestionandsixanswervariants\.Itisunknownwhetheranyofthevariantsiscorrect\.Question:\{question\}\.Answer1:\{answer1\}\.\.\.\.Answer6:\{answer6\}\.Brieflyreviewtheanswervariants\.Ifthecorrectanswerislisted,selectit\.Ifnoneofthesuggestedvariantsiscorrect,suggestyourowncorrectanswer\.ProvidetheresultinJSONformat:\{"reasoning":"yourbriefthoughtsarehere","answer":"youranswerhere"\}

### A\.4Talkative team prompt

[⬇](data:text/plain;base64,WW91IGFyZSBwYXJ0aWNpcGF0aW5nIGluIGFuIGludGVsbGVjdHVhbCBnYW1lLiBZb3UgYXJlIGdpdmVuIHRoZSBmb2xsb3dpbmcgcXVlc3Rpb24gYW5kIHNpeCBhbnN3ZXIgdmFyaWFudHMuIEZvciBlYWNoIGFuc3dlciwgdGhlIHBsYXllcidzIHJlYXNvbmluZyBpcyBwcm92aWRlZC4gSXQgaXMgdW5rbm93biB3aGV0aGVyIGFueSBvZiB0aGUgdmFyaWFudHMgaXMgY29ycmVjdC4KICBRdWVzdGlvbjoge3F1ZXN0aW9ufS4KICBBbnN3ZXIgMToKICB7CiAgICAicmVhc29uaW5nIjoge3JlYXNvbmluZzF9LAogICAgImFuc3dlciI6IHthbnN3ZXIxfQogIH0sCiAgLi4uCiAgQW5zd2VyIDY6CiAgewogICAgInJlYXNvbmluZyI6IHtyZWFzb25pbmc2fSwKICAgICJhbnN3ZXIiOiB7YW5zd2VyNn0KICB9CkJyaWVmbHkgcmV2aWV3IHRoZSBhbnN3ZXIgdmFyaWFudHMgYW5kIHRoZWlyIGV4cGxhbmF0aW9ucy4gSWYgdGhlIGNvcnJlY3QgYW5zd2VyIGlzIGxpc3RlZCwgc2VsZWN0IGl0LiBJZiBub25lIG9mIHRoZSBzdWdnZXN0ZWQgdmFyaWFudHMgaXMgY29ycmVjdCwgc3VnZ2VzdCB5b3VyIG93biBjb3JyZWN0IGFuc3dlci4KUHJvdmlkZSB0aGUgcmVzdWx0IGluIEpTT04gZm9ybWF0Ogp7CiAgInJlYXNvbmluZyI6ICJ5b3VyIGJyaWVmIHRob3VnaHRzIGFyZSBoZXJlIiwKICAiYW5zd2VyIjogInlvdXIgYW5zd2VyIGlzIGhlcmUiCn0=)Youareparticipatinginanintellectualgame\.Youaregiventhefollowingquestionandsixanswervariants\.Foreachanswer,theplayer’sreasoningisprovided\.Itisunknownwhetheranyofthevariantsiscorrect\.Question:\{question\}\.Answer1:\{"reasoning":\{reasoning1\},"answer":\{answer1\}\},\.\.\.Answer6:\{"reasoning":\{reasoning6\},"answer":\{answer6\}\}Brieflyreviewtheanswervariantsandtheirexplanations\.Ifthecorrectanswerislisted,selectit\.Ifnoneofthesuggestedvariantsiscorrect,suggestyourowncorrectanswer\.ProvidetheresultinJSONformat:\{"reasoning":"yourbriefthoughtsarehere","answer":"youranswerishere"\}

### A\.5LLM\-as\-a\-Judge prompt

[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQsIGEgcmVzZWFyY2hlciBvZiBhbnN3ZXJzIGluIGFuIGludGVsbGVjdHVhbCBnYW1lLiBZb3VyIHRhc2sgaXMgdG8gZXZhbHVhdGUgdGhlIGFuc3dlcnMgdG8gdGhlIHF1ZXN0aW9ucy4gRWFjaCBhbnN3ZXIgaXMgZXZhbHVhdGVkIHNlcGFyYXRlbHksIGluZGVwZW5kZW50bHkgb2Ygb3RoZXIgYW5zd2Vycy4KRm9yIGVhY2ggcXVlc3Rpb24sIHlvdSBhcmUgZ2l2ZW46CiAgImlkIiAtIHF1ZXN0aW9uIElELAogICJxdWVzdGlvbiIgLSB0aGUgcXVlc3Rpb24sCiAiYW5zd2VyIiAtIHRoZSBhbnN3ZXIgeW91IGhhdmUgdG8gZXZhbHVhdGUsCiAgImNvcnJlY3RfYW5zd2VyIiAtIHRoZSBjb3JyZWN0IGFuc3dlciwKICAidmFyaWF0aW9ucyIgLSBhY2NlcHRhYmxlIGFuc3dlciB2YXJpYW50cyB0aGF0IGFyZSBhbHNvIGNvdW50ZWQgYXMgY29ycmVjdC4KUmV0dXJuIHdpdGhvdXQgZnVydGhlciBjb21tZW50cyBvbmx5IGEgSlNPTiBsaXN0IG9mIHJhdGluZ3MsIHdoZXJlIGVhY2ggcmF0aW5nIGlzIGEgZGljdGlvbmFyeSB3aXRoIGtleXM6CiAgImlkIiAtIHF1ZXN0aW9uIElELAogICJpc19jb3JyZWN0IiAtIEJvb2xlYW4gdmFsdWUgaW5kaWNhdGluZyB3aGV0aGVyIHRoZSBhbnN3ZXIgaXMgY29ycmVjdC4=)Youareanexpert,aresearcherofanswersinanintellectualgame\.Yourtaskistoevaluatetheanswerstothequestions\.Eachanswerisevaluatedseparately,independentlyofotheranswers\.Foreachquestion,youaregiven:"id"\-questionID,"question"\-thequestion,"answer"\-theansweryouhavetoevaluate,"correct\_answer"\-thecorrectanswer,"variations"\-acceptableanswervariantsthatarealsocountedascorrect\.ReturnwithoutfurthercommentsonlyaJSONlistofratings,whereeachratingisadictionarywithkeys:"id"\-questionID,"is\_correct"\-Booleanvalueindicatingwhethertheansweriscorrect\.

Similar Articles

Created an LLM quiz program to check if AIs' performance varies over time

Reddit r/AI_Agents

A developer created LLM Canary, an open-source quiz program that sends randomized tasks to multiple LLMs to track performance over time. After a week of hourly testing across seven models, the results show all models fluctuate throughout the day with no consistent pattern, and no clear evidence of degradation was found.

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Hugging Face Daily Papers

Researchers evaluate 28 LLMs on the St. Petersburg game to distinguish between outcome-level resemblance and mechanism-level alignment in risk decision-making, finding that LLMs often produce human-like bids without underlying human-consistent reasoning mechanisms. The study demonstrates that behavioral alignment can be superficial, urging high-stakes evaluations to go beyond outcome similarity.

@rohanpaul_ai: https://x.com/rohanpaul_ai/status/2061959891036885027

X AI KOLs Following

A Stanford Law School study found that law professors rated LLM-generated answers higher than peer answers in a blinded evaluation of short-answer tutoring in contracts courses, with LLMs winning 75.33% of comparisons and being flagged as harmful less often.