BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
Summary
This paper introduces BehaviorBench, a comprehensive benchmark for evaluating foundation models on behavioral science tasks including behavior prediction, strategic decision-making, subject-trait inference, and behavioral knowledge application. It also presents Be.FM-1.5, a fine-tuned model that achieves strong distributional alignment, highlighting the gap between general-purpose and behaviorally adapted models.
View Cached Full Text
Cached at: 06/24/26, 07:45 AM
# BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
Source: [https://arxiv.org/html/2606.24162](https://arxiv.org/html/2606.24162)
Jin Huang1Yutong Xie1††footnotemark:Wanli Song1Xingjian Zhang1 Walter Yuan2Matthew O\. Jackson3,4Qiaozhu Mei1 1University of Michigan2MobLab3Stanford University4Santa Fe Institute 1\{huangjin,yutxie,wanlis,jimmyzxj,qmei\}@umich\.edu 2\{walter\.yuan\}@moblab\.com 3,4\{jacksonm\}@stanford\.edu
###### Abstract
Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics\. While these models show promise in individual tasks such as survey response prediction and human\-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations\. We introduceBehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: \(1\) behavior prediction and simulation, \(2\) strategic decision\-making, \(3\) subject\-trait inference, and \(4\) behavioral knowledge application\. Crucially,BehaviorBenchevaluates model outputs at both the individual and distributional levels, capturing not only per\-subject accuracy but also population\-level alignment, an essential requirement for behavioral validity\. Leveraging the tasks inBehaviorBench, we further developBe\.FM\-1\.5, extending theBe\.FMfamily of behavioral foundation models fine\-tuned on behavioral data\. Our results reveal a considerable gap: proprietary general\-purpose models excel at individual\-level prediction and knowledge\-intensive tasks, whereas behavioral foundation models, fine\-tuned on behavioral data, achieve substantially stronger distributional alignment\. Notably,Be\.FM\-1\.5leads on distributional metrics and remains competitive on individual\-level metrics, suggesting that proper behavioral adaptation can close the gap\. Our results highlight the importance of distributional evaluation, establishBehaviorBenchas a foundation for developing and assessing behaviorally aligned AI systems, and demonstrateBe\.FM\-1\.5’s potential for a broad range of behavioral science studies\. OurBehaviorBenchandBe\.FM\-1\.5models can be accessed via[https://umich\-foreseer\.github\.io/behaviorbench/](https://umich-foreseer.github.io/behaviorbench/)\.
## 1Introduction


Figure 1:Aggregated evaluation results of foundation models onBehaviorBench\.*Left:*individual\- vs\. distributional\-level performance in mean win rate \(%\) across 20 models\.*Right:*Snapshots of top 12 models on the BehaviorBench leaderboards, with models ranked independently at the individual and distributional evaluation levels by mean win rate \(Mean WR, %\)\. Foundation models in leaderboards are colored by model types, withBe\.FM\-1\.5models highlighted in bold\.Foundation models have transformed many fields\. Trained on vast and diverse data, they can adapt across tasks and are increasingly used not only for prediction, but also for reasoning, simulation, and scientific discovery\(Bommasaniet al\.,[2021](https://arxiv.org/html/2606.24162#bib.bib1); Parket al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib3)\)\. One field with particularly transformative potential is*behavioral science*: the science of understanding, predicting, and shaping human behaviors\. Drawing on multiple disciplines, including psychology, sociology, anthropology, and economics, behavioral science provides foundations for understanding decision\-making dynamics at individual, group, and societal levels\(Gintis,[2014](https://arxiv.org/html/2606.24162#bib.bib4)\)\. Its insights inform policymaking, healthcare, education, and business strategies, underscoring the need for more advanced AI\-driven approaches to behavioral analysis and intervention\(Jacksonet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib5)\)\.
Recent work has started to apply general\-purpose foundation models, including commercial and open\-source large language models, to behavioral science\(Demszkyet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib6); Meng,[2024](https://arxiv.org/html/2606.24162#bib.bib7); Bail,[2024](https://arxiv.org/html/2606.24162#bib.bib8)\)\. These models show promise in improving survey research\(Argyleet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib9); Grossmannet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib10)\), enabling automated content analysis\(Wuet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib11); Yang and Menczer,[2023](https://arxiv.org/html/2606.24162#bib.bib12); Gilardiet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib13)\), supporting agent\-based modeling\(Parket al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib2),[2024](https://arxiv.org/html/2606.24162#bib.bib14)\), and predicting human\-subject experiment results\(Hewittet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib15); Lippertet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib16)\)\. In parallel, researchers are also developing behavioral\-science\-specific foundation models by fine\-tuning LLMs on behavior\-related data\. These behavioral foundation models have demonstrated improved performance in simulating human behavior predicting survey responses, and supporting behavioral research workflows\(Xieet al\.,[2025b](https://arxiv.org/html/2606.24162#bib.bib17); Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18); Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\.
Despite this growing body of work, there still lacks a systematic understanding of how well foundation models perform across diverse behavioral science tasks, contexts, and populations\. Existing benchmarks are often limited by a narrow task family, such as survey response prediction\(Suhet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib20); Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21)\)\. As a result, important model capabilities, such as inferring latent traits of human subjects, remain unmeasured\. More critically, most benchmarks treat each human subject as an independent data point and score models using pointwise accuracy\(Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21); Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22); Duanet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib23); Huanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib24)\)\. This overlooks the*distributional*nature of human populations: behavioral science often requires models not only to predict individual outcomes, but also to preserve population\- or group\-level diversity and heterogeneity\(Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19); Xieet al\.,[2025a](https://arxiv.org/html/2606.24162#bib.bib25)\)\.
To address these limitations, we introduceBehaviorBench, a comprehensive benchmark for systematically evaluating foundation models on behavioral science tasks\. We organize our evaluation around the observation that human behavior is jointly shaped by multiple factors, including the behavioral context, the subject’s traits, and underlying motivations\(Xieet al\.,[2025b](https://arxiv.org/html/2606.24162#bib.bib17)\)\. Formally, this can be represented as a conditional probability
p\(y∣x,c;𝕂\),p\(y\\mid x,c;\\mathbb\{K\}\),\(1\)whereyydenotes a behavior,xxthe characteristics of the human subject,ccthe contextual variables defining the environment or intervention, and𝕂\\mathbb\{K\}represents relevant behavioral knowledge\. A foundation model that excels at behavioral science tasks should support multiple types of inference within this framework\. Accordingly, we evaluate models along four core capabilities: \(1\) behavior prediction and simulation \(forward inference overyyto mimic a human subject\), \(2\) strategic decision\-making \(the generation ofyyto achieve a specified goal\), \(3\) subject\-trait inference \(inverse inference overxx\), and \(4\) the application of behavioral science knowledge \(𝕂\\mathbb\{K\}\) to research problems\. Built on a diverse collection of datasets,BehaviorBenchcomprises 12 distinct tasks spanning these four capabilities\. We evaluate models at both the*individual*and the*distributional*levels, so that a model’s ability to align with a population’s behavioral variation is treated as a first\-class objective\.
Motivated by the general capabilities and distributional alignment goals inBehaviorBench, we further developBe\.FM\-1\.5, extending theBe\.FM\(Xieet al\.,[2025b](https://arxiv.org/html/2606.24162#bib.bib17)\)family of foundation models that are specifically designed for behavioral science tasks, as a feasibility test of targeted model development\.Be\.FM\-1\.5fine\-tunes open\-source LLMs on a substantially broader set of tasks, with explicit coverage of diverse capabilities, behavioral contexts, and human populations\.
Our evaluation reveals that current foundation models exhibit uneven strengths across behavioral science tasks: no single model family dominates the full benchmark\. General\-purpose proprietary LLMs excel at individual\-level prediction and knowledge\-intensive tasks, but they often lag behind on distributional metrics\. In contrast, behavioral foundation models achieve stronger distributional alignment on average \(Figure[1](https://arxiv.org/html/2606.24162#S1.F1)\)\. Among all evaluated models, onlyBe\.FM\-1\.5and Gemini 3\.1 Pro show leading performance on both distributional and individual\-level metrics\. Notably,Be\.FM\-1\.5is fine\-tuned on data that is held\-out fromBehaviorBench\. This suggests that accurate out\-of\-sample individual\-level predictions and faithful distributional alignment can be achieved simultaneously, revealing a desirable opportunity to adapt foundation models to behavioral science\. Overall, our results highlight the importance of distributional evaluation, establishBehaviorBenchas a foundation for developing and assessing behaviorally aligned AI systems, as well as demonstratingBe\.FM\-1\.5’s potential for a broad range of behavioral science studies\.
## 2BehaviorBench
We introduceBehaviorBench\. Section[2\.1](https://arxiv.org/html/2606.24162#S2.SS1)describes the underlying behavioral data for model evaluation\. Section[2\.2](https://arxiv.org/html/2606.24162#S2.SS2)details the tasks that fall into each of the four capability categories\. Section[2\.3](https://arxiv.org/html/2606.24162#S2.SS3)introduces the metrics, and Section[2\.4](https://arxiv.org/html/2606.24162#S2.SS4)comparesBehaviorBenchwith existing benchmarks\. Table[1](https://arxiv.org/html/2606.24162#S2.T1)summarizes all tasks inBehaviorBenchwith their evaluation\-set sizes and metrics\.
### 2\.1Data
Capturing the breadth of human behavior demands data of comparable breadth\. FollowingXieet al\.\([2025b](https://arxiv.org/html/2606.24162#bib.bib17)\),BehaviorBenchdraws on three sources of behavior data to operate the evaluation\. The complete data collection and processing procedures are described in Appendix[A](https://arxiv.org/html/2606.24162#A1)\.
1. 1\.Experimental data\.Experimental records capture behavioral patterns of diverse subjects in controlled environments, offering key insights into individual and group decision\-making\. We consider a collection of experiments from MobLab111MobLab:[https://www\.moblab\.com/](https://www.moblab.com/), retrieved on May 4, 2026\., which covers a diverse population of human subjects and their play records in multiple economics games\(Meiet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib26)\)\.BehaviorBenchuses an evaluation subsample of 12,685 behavioral observations from 6,279 unique subjects\.
2. 2\.Survey data\.Survey data provide self\-reported insights into personality traits, attitudes, and preferences\. We use the publicly available Big Five Personality Test dataset222[https://www\.kaggle\.com/datasets/lucasgreenwell/ocean\-five\-factor\-personality\-test\-responses](https://www.kaggle.com/datasets/lucasgreenwell/ocean-five-factor-personality-test-responses), retrieved on May 4, 2026\.as a data source\. This dataset consists of demographic information of subjects \(e\.g\., race, age, gender\), along with their answers to 50 survey questions\. The answers can be aggregated into five scores, representing their personality factors\.BehaviorBenchincludes an evaluation subsample of 6,000 responses from 1,886 unique subjects\.
3. 3\.Literature data\.The behavioral science literature documents existing theories and scientific findings about human behavior\. We collect titles and abstracts of 104 articles from the*American Economic Review*journal and 140 articles from the*Nature Human Behaviour*journal that are published in 2025\. We then followZhanget al\.\([2025](https://arxiv.org/html/2606.24162#bib.bib27)\)to decompose each title–abstract pair into a structured research workflow that reflects how the studies are done: context, key idea, method, outcome, and projected impact\. Such data are included inBehaviorBenchto evaluate foundation models’ ability in scientific reasoning and assisting behavioral science research\.
### 2\.2Evaluation Tasks
Eval\. MetricCapabilityTask\# Eval\. DataInd\.Dist\.Behavior prediction and simulationSingle\-round game behavior simulation \(Game Behav\. Sim\.\)200/game–WMulti\-round game behavior prediction \(Multi\-Round Pred\.\)500/gameMAEWSingle\-round game behavior prediction given observations from other games \(Across\-Ctx Pred\.\)750/gameMAEWSurvey response prediction given demographics \(Demo\. To Resp\.\)1,000AccWSequential survey response prediction \(Seq\. Resp\. Pred\.\)1,000AccWMasked survey response prediction \(Masked Resp\. Pred\.\)1,000AccWStrategic decision\-makingStrategic game play1,000Win \(%\)–Subject\-trait inferencePersonality score prediction given demographics \(Demo\. To Pers\.\)1,000MAEWPersonality score prediction given scores from other dimensions \(Across\-Dim Pers\. Pred\.\)1,000MAEWAge prediction given personality scores \(Pers\. To Demo\.\)1,000MAEWKnowledge applicationScientific workflow prediction1,084BLEURT–Economics contest problem solving124Acc–
Table 1:Evaluation tasks inBehaviorBench, by core capabilities\.\#Eval\. Data: evaluation set sizes\. Metrics: individual level \(Ind\.\) and distributional level \(Dist\., “W” stands for Wasserstein distance\)\. Per\-task data sources and processing procedures are detailed in Appendix[A](https://arxiv.org/html/2606.24162#A1)\.By formulating the generation of human behavior asp\(y∣x,c,𝕂\)p\(y\\mid x,c,\\mathbb\{K\}\),BehaviorBenchevaluates foundation models with four categories of behavioral tasks: \(1\) behavior prediction and simulation \(forward inference overyy\), \(2\) strategic decision\-making \(the generation ofyyto achieve a goal\), \(3\) subject\-trait inference \(inverse inference overxx\), and \(4\) the application of behavioral science knowledge \(𝕂\\mathbb\{K\}\) for scientific reasoning\. Formally, let𝒴\\mathcal\{Y\}denote the space of human behaviors \(e\.g\., actions in experiments and survey responses\),𝒳\\mathcal\{X\}the space of human subject traits \(e\.g\., demographics and personalities\), and𝒞\\mathcal\{C\}the space of contextual variables \(e\.g\., experiment setup and intervention\)\.
#### 2\.2\.1Behavior Prediction and Simulation
Predicting and simulating human behavior is at the core of behavior modeling\(Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21); Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18); Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\. Behavior prediction arises naturally at two levels: the individual level and the distributional level\.
At the individual level, the task is to predict the behavior of a particular subject in a given context\. Formally, this can be represented as a projection
𝒳×𝒴×𝒞→𝒴\.\\mathcal\{X\}\\times\\mathcal\{Y\}\\times\\mathcal\{C\}\\to\\mathcal\{Y\}\.\(2\)Concretely, through a foundation model, an input\(x,y′,c\)∈𝒳×𝒴×𝒞\(x,y^\{\\prime\},c\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}\\times\\mathcal\{C\}is mapped to an outputy∈𝒴y\\in\\mathcal\{Y\}, wherexxdenotes the characteristics of a particular subject,y′y^\{\\prime\}is partial information about the subject’s behavior which can be empty,ccis the context \(e\.g\., experiment instructions and setup\), andyyis the behavior to be predicted \(e\.g\., a subject’s choice in the next term\)\.
At the distributional level, the target is to predict a behavior distribution across a population\. To obtain a behavior distributionYY, the foundation model can run individual\-level predictions over a population\(X,Y′\)=\{\(xi∈𝒳,yi′∈𝒴\)\}i=1N\(X,Y^\{\\prime\}\)=\\\{\(x\_\{i\}\\in\\mathcal\{X\},y^\{\\prime\}\_\{i\}\\in\\mathcal\{Y\}\)\\\}\_\{i=1\}^\{N\}in a contextcc\. We then evaluate whetherYYaligns with the empirical human behavior distribution\.
BehaviorBenchconstructs evaluation of this capability from two data sources: experimental data from economic games and survey data from Big Five survey responses\.
Game play simulation and prediction\.Using the MobLab game\-play records described in Section[2\.1](https://arxiv.org/html/2606.24162#S2.SS1), we define three simulation/prediction tasks\. \(1\)*Single\-round game behavior simulation\.*Given the game instructions ascc, a model predicts the distribution of the population’s actions in a single \(first\) round,p^\(y∣c\)\\hat\{p\}\(y\\mid c\)\. The purpose is to measure whether a model distributionally aligns with the human population when playing a game\. \(2\)*Multi\-round game behavior prediction\.*Given the game instructionsccand a subject’s behavior in prior rounds asy′y^\{\\prime\}, a model predicts their action in the next roundyy\. Different from \(1\), here we condition on a subject’s decision trajectory\. \(3\)*Across\-context prediction\.*Given a subject’s behavior in other games asy′y^\{\\prime\}, a model predicts the first\-round behavior in a target game asyy\.
Survey response prediction\.Using the Big Five responses described in Section[2\.1](https://arxiv.org/html/2606.24162#S2.SS1), we define three prediction tasks\. \(1\)*Survey response prediction given demographics\.*Given a subject’s demographics asxxand a single survey item ascc, a model predicts the subject’s responseyy\. This task is consistent with the setting in prior works\(Santurkaret al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib28); Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21); Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\. \(2\)*Sequential survey response prediction\.*Given a subject’s responses to prior items in survey order asy′y^\{\\prime\}, a model sequentially predicts the responseyyto the next itemcc\. \(3\)*Masked survey response prediction\.*Given all 49 other responses asy′y^\{\\prime\}, a model predicts the responseyyto one masked survey itemcc\. This task reflects the long\-standing challenge of missing response prediction in survey research\(Brick and Kalton,[1996](https://arxiv.org/html/2606.24162#bib.bib29); Little and Rubin,[2019](https://arxiv.org/html/2606.24162#bib.bib30)\)\.
#### 2\.2\.2Strategic Decision\-making
Section[2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1)concerns predicting and simulating a subject’s behavior\. Here, we examine a distinct task: making strategic decisions to achieve a specified goal\. In this setting, actions are taken on one’s own behalf, rather than mimicking a particular human subject\. This capability is especially important as AI and LLM agents increasingly make autonomous decisions and interact with humans\(Meiet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib26); Jacksonet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib5)\)\. Formally, this task requires a model to make the projection
𝒞×𝒴→𝒴,\\mathcal\{C\}\\times\\mathcal\{Y\}\\to\\mathcal\{Y\},\(3\)where an input\(c,yothers\)∈𝒞×𝒴\(c,y\_\{\\text\{others\}\}\)\\in\\mathcal\{C\}\\times\\mathcal\{Y\}is mapped to an actiony∈𝒴y\\in\\mathcal\{Y\}\. Hereccis the context \(e\.g\., experiment instructions and setup\),yothersy\_\{\\text\{others\}\}is other subjects’ actions, andyyis the decision generated to achieve the specified goal\. Prior work has used economic games to measure this capability\(Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22); Guoet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib31); Huanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib24); Duanet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib23)\)\. However, most prior works have LLMs play against other LLMs; few have LLMs play against human players\(Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22)\)\.
BehaviorBenchuses human play records of the Beauty Contest game from the MobLab data source to construct an evaluation\. In Beauty Contest, each player picks a number in\[0,100\]\[0,100\]for every round \(yy\)\. The winner of that round is whoever comes closest to23\\frac\{2\}\{3\}of the group average \(average ofyyandyothersy\_\{\\text\{others\}\}\)\. At the end of each round, each player is shown the group average and gets feedback on whether they won that round\. This game involves opponent modeling and extensive reasoningNagel \([1995](https://arxiv.org/html/2606.24162#bib.bib32)\); Hoet al\.\([1998](https://arxiv.org/html/2606.24162#bib.bib33)\)\.
#### 2\.2\.3Subject\-trait Inference
In Section[2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1)and[2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2), we introduced tasks for predicting, simulating, and generating behavior \(yy\) under various settings\. Here, we introduce the inverse tasks for inferring subjects’ traits \(xx\)\. Characterizing individual subjects and populations is a core goal of behavioral science, supporting market segmentation\(Yang and Zhai,[2022](https://arxiv.org/html/2606.24162#bib.bib34)\), user profiling\(Rychalskaet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib35)\), personalized interventions\(Demszkyet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib6)\), and policy design\(Bleidornet al\.,[2019](https://arxiv.org/html/2606.24162#bib.bib36)\)\. Similar to Section[2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1), the task of inferring a subject’s traits can also be formulated at both individual and distributional levels\. At the individual level, the task is to infer a latent characteristic of a particular subject from their observed behavior\. Formally, a foundation model is tasked with the projection
𝒴×𝒳×𝒞→𝒳\.\\mathcal\{Y\}\\times\\mathcal\{X\}\\times\\mathcal\{C\}\\to\\mathcal\{X\}\.\(4\)Concretely, an input\(y,x′,c\)∈𝒴×𝒳×𝒞\(y,x^\{\\prime\},c\)\\in\\mathcal\{Y\}\\times\\mathcal\{X\}\\times\\mathcal\{C\}is mapped to an outputx∈𝒳x\\in\\mathcal\{X\}, whereyyis the subject’s behavior under contextcc,x′x^\{\\prime\}is partial information about the subject, andxxis a latent characteristic that can be reflected byyy\.
At the distributional level, the target is to infer a distribution of latent traits across a specific population\. To obtain a distribution of inferred characteristicsXX, we run individual\-level inferences over a population\(Y,X′\)=\{\(yi∈𝒴,xi′∈𝒳\)\}i=1N\(Y,X^\{\\prime\}\)=\\\{\(y\_\{i\}\\in\\mathcal\{Y\},x^\{\\prime\}\_\{i\}\\in\\mathcal\{X\}\)\\\}\_\{i=1\}^\{N\}in a contextcc\. We then evaluate whetherXXaligns with the empirical human trait distribution\.
The Big Five dataset provides various traits about subjects, including demographic profile and personality scores along five dimensions\. We introduce three tasks inBehaviorBench\. \(1\)*Personality score prediction given demographics\.*Given a subject’s demographics asx′x^\{\\prime\}\(e\.g\., age, gender, and native language\), a model predicts their score on a Big Five personality dimension asxx\. \(2\)*Across\-dimension personality score prediction\.*Given a subject’s scores on four personality dimensions asx′x^\{\\prime\}, aggregated from the subject’s survey responsesyy, a model predicts their score on the remaining dimension asxx\. \(3\)*Age prediction given personality scores\.*Given a subject’s five dimensional personality asx′x^\{\\prime\}, aggregated from survey responses, a model predicts the subject’s age asxx\.
#### 2\.2\.4The Application of Behavioral Science Knowledge
Beyond prediction and simulation, a foundation model should be able to represent and apply behavioral science knowledge to support scientific reasoning and problem solving\. These tasks take the form of applying knowledge𝕂\\mathbb\{K\}to solve the input questionqq, in order to get an answeraa\.
Research workflow reasoning\.FollowingZhanget al\.\([2025](https://arxiv.org/html/2606.24162#bib.bib27)\), we decompose a paper’s title and abstract into five research\-workflow aspects: context, key idea, method, outcome, and projected impact\. We then define five generation tasks: key idea generation, method recommendation, outcome prediction, impact prediction, and title prediction\. This group of tasks reflects models’ ability to assist behavioral research workflows\.
Complex problem solving\.To test whether the model can apply domain knowledge to unseen problems, we use multiple\-choice questions from International Economics Olympiad \(IEO\) contests, spanning from 2018 to 2025333International Economics Olympiad \(IEO\) official website:[https://ieo\-official\.org/](https://ieo-official.org/), retrieved May 4, 2026\.\. Model choices are compared with official solutions\.
### 2\.3Metrics
For individual\-level evaluation, we use:
- •Mean absolute error \(MAE\)for the prediction of numeric quantities, including game actions, trait scores, and age\. We choose MAE because it preserves the ordinal structure of the response, so that predictions numerically closer to the true value are scored as better\.
- •Accuracyfor categorical predictions, including predicting survey responses and answering IEO questions, which aligns with prior works\(Suhet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib20); Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\.
- •Win ratefor strategic decision\-making \(i\.e\., the Beauty Contest game\), defined as the fraction of sessions in which the model’s guess lies closer to23\\frac\{2\}\{3\}of the group average than any human player’s guess in that session\.
- •BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.24162#bib.bib37)\)for open\-ended generation in the research workflow task, followingZhanget al\.\([2025](https://arxiv.org/html/2606.24162#bib.bib27)\)\.
For distribution\-level evaluation, we use theWasserstein distance\(noted asWW\) to compare the predicted behavior distribution against the observed human distribution\. FollowingSuhet al\.\([2025](https://arxiv.org/html/2606.24162#bib.bib20)\); Kolluriet al\.\([2025](https://arxiv.org/html/2606.24162#bib.bib19)\), we chooseWWbecause it captures both the shape and the mean of two distributions\. More details on the evaluation metrics are in Appendix[B](https://arxiv.org/html/2606.24162#A2)\.
### 2\.4Comparison with Existing Benchmarks
Table[2](https://arxiv.org/html/2606.24162#S2.T2)comparesBehaviorBenchwith existing benchmarks for behavioral\-science\-related tasks\. Existing benchmarks primarily target narrow task families: SocioBench\(Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21)\)and SubPOP\(Suhet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib20)\)evaluate behavior prediction from demographics or context, while GLEE\(Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22)\), GTBench\(Duanet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib23)\), and GAMA\-Bench\(Huanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib24)\)evaluate strategic decision\-making in games\. These benchmarks focus on behavioryy; none of them evaluates the inference of subject traitsxxor the application of behavioral knowledge𝕂\\mathbb\{K\}\. In contrast,BehaviorBenchdiffers in scope and spans four capability categories of behavioral science tasks\. What is important is thatBehaviorBenchsupports evaluation at both the individual and distributional levels, enabling analysis of how well models preserve the diversity and heterogeneity of human populations rather than merely aligning with typical or average subjects\. We also identify several publicly available datasets that could potentially be developed into benchmarks, including OpinionQA\(Santurkaret al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib28)\), Psych\-101\(Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18)\), and SocSci210\(Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\. However, these datasets likewise do not fully cover the four capability categories\. More details on these benchmarks and datasets are provided in Section[6\.2](https://arxiv.org/html/2606.24162#S6.SS2)\.
CapabilityPredict &Sim\. Behav\.StrategicDecision\-MakingInfer Subj\.TraitsApplyKnowledgeEval\. LevelTargeted Variableyyxx𝕂\\mathbb\{K\}Indiv\.Distrib\.BehaviorBench\(Ours\)✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmarkSocioBench\(Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21)\)✓\\checkmark✓\\checkmarkSubPOP\(Suhet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib20)\)✓\\checkmark✓\\checkmarkGLEE\(Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22)\)✓\\checkmark✓\\checkmarkGTBench\(Duanet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib23)\)✓\\checkmark✓\\checkmarkGAMA\-Bench\(Huanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib24)\)✓\\checkmark✓\\checkmark
Table 2:Comparison ofBehaviorBenchwith existing behavioral benchmarks\.BehaviorBenchis the only one evaluating all four capabilities at both individual and distributional levels\.
## 3Be\.FM\-1\.5: Towards Advanced Behavioral Foundation Models
BehaviorBenchsets a higher bar for behavioral foundation models, covering four capability categories and evaluating performance at both the individual and population levels\. For this reason, we further developBe\.FM\-1\.5, extending theBe\.FMfamily \(the originalBe\.FMmodels are denotedBe\.FM\-1\)\.Be\.FM\-1\.5extends the training recipe by fine\-tuning open\-source LLMs on a substantially broader set of behavioral tasks, including an explicit coverage of diverse capabilities, behavioral contexts, and human populations\.
### 3\.1Backbone Models and Fine\-tuning Schema
Be\.FM\-1\.5comes in two sizes, each fine\-tuned from a widely\-used open\-weight backbone:Be\.FM\-1\.5\-4Bfrom Qwen3\-4B\(Team,[2025](https://arxiv.org/html/2606.24162#bib.bib38)\)andBe\.FM\-1\.5\-70Bfrom Llama\-3\.3\-70B\(Team,[2024](https://arxiv.org/html/2606.24162#bib.bib39)\)\. We use parameter\-efficient supervised fine\-tuning, applying Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.24162#bib.bib40)\)\(rank 8,α=32\\alpha=32\) to all linear layers via thems\-swiftframework\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib41)\)\. Both backbone models are trained for one epoch on 4×\\timesA100 80GB GPUs with a peak learning rate of10−410^\{\-4\}, sequence length 2048, and effective batch size 128\. The training ofBe\.FM\-1\.5\-4BandBe\.FM\-1\.5\-70Btakes around 60 and 600 A100 hours respectively\.
### 3\.2Fine\-tuning Tasks and Data
Inspired byBehaviorBench, we formulate fine\-tuning tasks as described in Section[2\.2](https://arxiv.org/html/2606.24162#S2.SS2)and curate fine\-tuning data from the three sources as described in Section[2\.1](https://arxiv.org/html/2606.24162#S2.SS1)\. It is important to note that, though the fine\-tuning shares the same task formulation withBehaviorBench, the fine\-tuning data do not overlap with theBehaviorBenchevaluation data\. Data splitting procedures are summarized in Appendix[A\.2](https://arxiv.org/html/2606.24162#A1.SS2)\.
Particularly, we instantiate 11 fine\-tuning tasks by the general capabilities they target: behavior prediction and simulation, strategic decision\-making, subject\-trait inference, and the application of behavioral knowledge\. The task mixture is deliberately broad\. It exposesBe\.FM\-1\.5to all components in the framework of Equation[1](https://arxiv.org/html/2606.24162#S1.E1)\(i\.e\., behavioryy, subject traitsxx, contextcc, and behavioral science knowledge𝕂\\mathbb\{K\}\)\. This breadth is intended to encourage generalization across tasks, contexts, and populations, rather than specialization in any single task or dataset\. Details on fine\-tuning data are described in Appendix[A\.4](https://arxiv.org/html/2606.24162#A1.SS4)\.
## 4Models Evaluated
We benchmark three types of foundation models representing different model types: \(1\) open\-source general\-purpose LLMs, \(2\) proprietary LLMs, and \(3\) behavioral foundation models that are specifically trained on behavior\-related data\.
Open\-source LLMs\.We include three groups of open\-source models, each chosen for a specific comparison goal\. Qwen3\-4B\(Team,[2025](https://arxiv.org/html/2606.24162#bib.bib38)\)is one of the strongest small open\-weight models\. Llama\-3\.3\-70B\(Team,[2024](https://arxiv.org/html/2606.24162#bib.bib39)\)is a widely used open\-weight backbone for downstream fine\-tuning\. Also, asBe\.FM\-1\.5is fine\-tuned from these two, including them lets us assess the effect of supervised fine\-tuning\. For broader coverage of the open\-source models, we additionally include DeepSeek V3\.2\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.24162#bib.bib42)\), a leading 671B\-parameter open\-source model\.
Proprietary LLMs\.We include three families of widely used frontier proprietary models\. Within each family we include different capability tiers\. For Claude, we include Claude Haiku 4\.5\(Anthropic,[2025](https://arxiv.org/html/2606.24162#bib.bib43)\), Claude Sonnet 4\.6\(Anthropic,[2026b](https://arxiv.org/html/2606.24162#bib.bib44)\), and Claude Opus 4\.6\(Anthropic,[2026a](https://arxiv.org/html/2606.24162#bib.bib45)\), spanning weak to strong capability\. For GPT, we include two leading reasoning models, GPT\-5\.4\-mini\(OpenAI,[2026a](https://arxiv.org/html/2606.24162#bib.bib46)\)and GPT\-5\.4\(OpenAI,[2026b](https://arxiv.org/html/2606.24162#bib.bib47)\), together with GPT\-4\.1\(OpenAI,[2025a](https://arxiv.org/html/2606.24162#bib.bib48)\)as the strongest non\-reasoning GPT model\. For Gemini, we include Gemini 3\.1 Flash and Gemini 3\.1 Pro\(Google,[2026](https://arxiv.org/html/2606.24162#bib.bib49)\)\. We use the default inference hyperparameters for all models\. One exception is for GPT\-5\.4 and GPT\-5\.4\-mini, we set thereasoning\_effortparameter tohigh\(default isnone\), to ensure fair comparison with other frontier models\.
Behavioral foundation models\.This category includes foundation models fine\-tuned on behavior\-related data\. In theBehaviorBenchevaluation, we include three representative model families\. Centaur\-70B\(Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18)\)is fine\-tuned on a cognitive\-psychology corpus, while Socrates\(Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)is fine\-tuned on social\-science survey data and is evaluated in both SFT and DPO variants\.Be\.FMmodels include the originalBe\.FM\-1\(Xieet al\.,[2025b](https://arxiv.org/html/2606.24162#bib.bib17)\)and the extendedBe\.FM\-1\.5models \(as described in Section[3](https://arxiv.org/html/2606.24162#S3)\)\.
## 5Results and Discussion
CapabilityBehavior Prediction & SimulationStrategicTaskGame Behav\. Sim\.Multi\-Round Pred\.Across\-Ctx Pred\.Demo\. To Resp\.Seq\. Resp\. Pred\.Masked Resp\. Pred\.Game PlayMetricW↓\\downarrowMAE↓\\downarrowW↓\\downarrowMAE↓\\downarrowW↓\\downarrowAcc↑\\uparrowW↓\\downarrowAcc↑\\uparrowW↓\\downarrowAcc↑\\uparrowW↓\\downarrowWin Rate↑\\uparrowQwen3\-4B26\.921\.713\.127\.920\.124\.3%1\.1437\.0%0\.6537\.2%0\.669%Llama3\.3\-70B24\.227\.015\.326\.414\.928\.4%1\.0237\.7%0\.5943\.9%0\.476%DeepSeek V3\.216\.920\.910\.127\.512\.327\.2%0\.7636\.2%0\.5440\.6%0\.4714%Claude Haiku 4\.524\.521\.012\.526\.017\.326\.0%0\.9340\.2%0\.5244\.7%0\.4313%Claude Sonnet 4\.624\.618\.710\.625\.319\.328\.6%0\.8641\.8%0\.4546\.7%0\.3617%Claude Opus 4\.626\.019\.311\.725\.920\.129\.7%0\.8641\.8%0\.4345\.1%0\.3715%GPT 5\.4 mini29\.023\.717\.627\.418\.028\.8%0\.9141\.7%0\.4944\.3%0\.4610%GPT 4\.127\.518\.16\.927\.116\.329\.2%0\.7939\.0%0\.4846\.4%0\.3817%GPT 5\.431\.421\.115\.226\.114\.029\.6%0\.9042\.4%0\.4647\.3%0\.388%Gemini 3\.1 Flash31\.325\.518\.629\.417\.330\.5%0\.8240\.5%0\.5246\.3%0\.4114%Gemini 3\.1 Pro21\.319\.412\.526\.710\.330\.5%0\.7542\.5%0\.4648\.5%0\.3612%Socrates\-Llama3\-8B\-SFT16\.339\.521\.337\.714\.624\.0%0\.5523\.5%0\.6224\.0%0\.672%Socrates\-Llama3\-8B\-DPO26\.427\.319\.727\.721\.424\.7%0\.9325\.9%0\.9628\.4%0\.763%Socrates\-Qwen\-14B\-SFT17\.935\.915\.936\.714\.723\.8%0\.6923\.8%0\.7126\.2%0\.692%Socrates\-Qwen\-14B\-DPO20\.020\.514\.424\.817\.723\.9%1\.0135\.3%0\.7137\.8%0\.568%Centaur\-70B16\.923\.311\.129\.313\.924\.7%0\.4333\.1%0\.4836\.1%0\.494%Be\.FM\-1\-8B8\.834\.112\.435\.45\.421\.6%0\.6027\.1%0\.5924\.4%0\.622%Be\.FM\-1\-70B8\.732\.08\.634\.97\.622\.3%0\.4931\.4%0\.4634\.1%0\.492%\[1pt/2pt\] Be\.FM\-1\.5\-4B \(Ours\)7\.018\.45\.727\.55\.928\.4%0\.4239\.9%0\.3644\.7%0\.3448%Be\.FM\-1\.5\-70B \(Ours\)8\.918\.03\.527\.86\.528\.6%0\.4440\.4%0\.3644\.9%0\.3350%
Table 3:Benchmarking results of behavior prediction and simulation, as well as strategic decision\-making\. Tasks and abbreviations are described in Table[1](https://arxiv.org/html/2606.24162#S2.T1)\.Boldandunderlinedenote the best and second best performance\.CapabilitySubject\-trait InferenceKnowledge ApplicationTaskDemo\. To Pers\.Across\-Dim Pers\. Pred\.Pers\. To Demo\.Sci\. Workflow Pred\.Econ\. ContestMetricMAE↓\\downarrowW↓\\downarrowMAE↓\\downarrowW↓\\downarrowMAE↓\\downarrowW↓\\downarrowBLEURT↑\\uparrowAcc\.↑\\uparrowQwen3\-4B7\.35\.147\.65\.2111\.210\.440\.45073\.9%±1\.373\.9\\%\{\\scriptstyle\\,\\pm\\,1\.3\}Llama3\.3\-70B6\.95\.188\.35\.199\.37\.670\.43065\.8%±0\.665\.8\\%\{\\scriptstyle\\,\\pm\\,0\.6\}DeepSeek V3\.27\.35\.127\.54\.6110\.77\.570\.43074\.4%±1\.674\.4\\%\{\\scriptstyle\\,\\pm\\,1\.6\}Claude Haiku 4\.57\.65\.947\.24\.5911\.86\.000\.42870\.0%±2\.270\.0\\%\{\\scriptstyle\\,\\pm\\,2\.2\}Claude Sonnet 4\.67\.65\.837\.24\.3714\.58\.720\.47283\.1%±1\.383\.1\\%\{\\scriptstyle\\,\\pm\\,1\.3\}Claude Opus 4\.67\.05\.186\.74\.4913\.38\.470\.47795\.6%±0\.6\\mathbf\{95\.6\}\\%\{\\scriptstyle\\,\\pm\\,0\.6\}GPT 5\.4 mini7\.25\.537\.45\.0210\.96\.130\.45493\.2¯%±1\.2\\underline\{93\.2\}\\%\{\\scriptstyle\\,\\pm\\,1\.2\}GPT 4\.16\.94\.927\.23\.9310\.64\.010\.46078\.4%±0\.878\.4\\%\{\\scriptstyle\\,\\pm\\,0\.8\}GPT 5\.47\.45\.707\.34\.4010\.54\.050\.46295\.6%±0\.6\\mathbf\{95\.6\}\\%\{\\scriptstyle\\,\\pm\\,0\.6\}Gemini 3\.1 Flash7\.14\.787\.14\.9814\.08\.170\.43277\.6%±1\.777\.6\\%\{\\scriptstyle\\,\\pm\\,1\.7\}Gemini 3\.1 Pro6\.44\.386\.83\.6010\.51\.670\.47095\.6%±0\.4\\mathbf\{95\.6\}\\%\{\\scriptstyle\\,\\pm\\,0\.4\}Socrates\-Llama3\-8B\-SFT13\.310\.3812\.49\.4712\.64\.810\.0260\.0%±0\.00\.0\\%\{\\scriptstyle\\,\\pm\\,0\.0\}Socrates\-Llama3\-8B\-DPO8\.76\.148\.66\.739\.65\.970\.42639\.7%±4\.639\.7\\%\{\\scriptstyle\\,\\pm\\,4\.6\}Socrates\-Qwen\-14B\-SFT9\.03\.779\.84\.9011\.03\.690\.0240\.0%±0\.00\.0\\%\{\\scriptstyle\\,\\pm\\,0\.0\}Socrates\-Qwen\-14B\-DPO8\.16\.469\.06\.8510\.57\.830\.44870\.3%±0\.670\.3\\%\{\\scriptstyle\\,\\pm\\,0\.6\}Centaur\-70B9\.83\.239\.33\.8310\.65\.180\.43037\.7%±2\.837\.7\\%\{\\scriptstyle\\,\\pm\\,2\.8\}Be\.FM\-1\-8B7\.81\.958\.22\.539\.65\.580\.43541\.8%±3\.041\.8\\%\{\\scriptstyle\\,\\pm\\,3\.0\}Be\.FM\-1\-70B7\.81\.928\.62\.729\.55\.480\.44660\.5%±1\.760\.5\\%\{\\scriptstyle\\,\\pm\\,1\.7\}\[1pt/2pt\] Be\.FM\-1\.5\-4B \(Ours\)7\.41\.777\.01\.569\.54\.790\.46853\.3%±2\.553\.3\\%\{\\scriptstyle\\,\\pm\\,2\.5\}Be\.FM\-1\.5\-70B \(Ours\)7\.22\.306\.81\.689\.46\.600\.47773\.0%±2\.473\.0\\%\{\\scriptstyle\\,\\pm\\,2\.4\}
Table 4:Benchmarking results of subject\-trait inference and knowledge application\. Tasks and abbreviations are described in Table[1](https://arxiv.org/html/2606.24162#S2.T1)\.Boldandunderlinedenote the best and second best performance\.### 5\.1Model Capabilities
Tables[3](https://arxiv.org/html/2606.24162#S5.T3)and[4](https://arxiv.org/html/2606.24162#S5.T4)report model performance acrossBehaviorBenchtasks, grouped by capabilities\.
Behavior prediction and simulation\.Be\.FM\-1\.5models perform strongly across both economic\-game and survey\-based tasks\. In economic games,Be\.FM\-1\.5\-4Bachieves the lowest Wasserstein distance in single\-round behavior simulation \(W=7\.0W=7\.0\), whileBe\.FM\-1\.5\-70Bperforms best on multi\-round prediction at both the individual and distributional levels \(MAE=18\.0=18\.0,W=3\.5W=3\.5\)\. These gains are large relative to their backbone models \(Qwen3\-4B:W=26\.9W=26\.9in single\-round behavior simulation; Llama\-3\.3\-70B: MAE=27\.027\.0andW=15\.3W=15\.3in multi\-round prediction\)\. In survey response prediction, frontier proprietary models retain the strongest individual\-level accuracy, with Gemini 3\.1 Pro excelling in demographic\-to\-response, sequential response, and masked response prediction\. However,Be\.FM\-1\.5achieves the strongest distributional alignment on these survey tasks, in Wasserstein distance\. Other behavioral foundation models also show distributional strengths: for example, Centaur\-70B obtainsW=0\.43W=0\.43on demographic\-to\-response prediction, close toBe\.FM\-1\.5, although its individual\-level accuracies are lower\.
Strategic decision\-making\.Be\.FM\-1\.5shows the clearest advantage in strategic decision\-making\. In the Beauty Contest game,Be\.FM\-1\.5\-4BandBe\.FM\-1\.5\-70Bachieve win rates of48%48\\%and50%50\\%, respectively, far above the next\-best models, Claude Sonnet 4\.6 and GPT 4\.1, which each reach17%17\\%\. This suggests thatBe\.FM\-1\.5is not only good at passive prediction, but also adaptive to decision\-making in interactive play with humans\. By contrast, earlier behavioral models such as Centaur\-70B and Socrates variants obtain substantially lower win rates, ranging from2%2\\%to8%8\\%\.
Subject\-trait inference\.Model performance is more mixed for subject\-trait inference\. Frontier and large general\-purpose models remain competitive on individual\-level prediction: Gemini 3\.1 Pro achieves the best MAE for demographic\-to\-personality prediction \(MAE=6\.46\.4\), Claude Opus 4\.6 performs best on across\-dimension personality prediction \(MAE=6\.76\.7\), and Llama\-3\.3\-70B achieves the lowest MAE for age prediction from personality scores \(MAE=9\.39\.3\)\. At the same time, behavioral foundation models are much stronger at matching population\-level trait distributions\.Be\.FM\-1\.5\-4Bachieves the best distributional alignment for demographic\-to\-personality prediction \(W=1\.77W=1\.77\) and across\-dimension personality prediction \(W=1\.56W=1\.56\), substantially outperforming Gemini 3\.1 Pro \(W=4\.38W=4\.38and3\.603\.60\)\. Centaur\-70B and Socrates\-Qwen\-14B\-SFT also perform competitively on distributional metrics, withW=3\.23W=3\.23and3\.773\.77on demographic\-to\-personality prediction, respectively, but they lag behindBe\.FM\-1\.5and often show weaker individual\-level prediction\. These results suggest that behavior\-specialized training helps recover population\-level structure, even when individual trait inference remains challenging\.
Behavioral knowledge application\.Frontier proprietary models retain an advantage on contest\-style economic reasoning\. Claude Opus 4\.6, GPT 5\.4, and Gemini 3\.1 Pro each achieve95\.6%95\.6\\%accuracy on the economics contest task, substantially higher thanBe\.FM\-1\.5\-70Bat73\.0%73\.0\\%andBe\.FM\-1\.5\-4Bat53\.3%53\.3\\%\. This indicates that broad reasoning ability and general world knowledge remain important for abstract economic problem solving\. However,Be\.FM\-1\.5\-70Bperforms strongly on scientific workflow prediction, matching the best BLEURT score among all models \(0\.4770\.477\), whileBe\.FM\-1\.5\-4Balso remains competitive \(0\.4680\.468\)\. In contrast, Centaur\-70B and most Socrates variants perform notably worse on knowledge\-intensive tasks, especially the economics contest, suggesting that models fine\-tuned for behavioral prediction do not automatically acquire broader behavioral\-science reasoning ability\.
Overall, the results show that current foundation models exhibit uneven strengths across capabilities, and no single model family dominates all tasks\. Instead, different model families excel in different parts of the benchmark: frontier proprietary LLMs are strongest on knowledge\-intensive reasoning and several individual\-level prediction tasks \(e\.g\., survey response prediction and personality inference\), while behavioral foundation models are generally better at behavior simulation in economic games and distributional\-level behavioral alignment\.
### 5\.2BehaviorBenchLeaderboards
Beyond heterogeneous per\-task metrics, we aim to derive a unified score to compare models across tasks\. We reduce each task to a pairwise comparison among the models, inspired by the Holistic Evaluation of Language Models \(HELM\) leaderboards\(Lianget al\.,[2022](https://arxiv.org/html/2606.24162#bib.bib50)\)\. Concretely, we define a model’swin rateas the fraction of other models it outperforms on a given task \(based on the task\-specific metric\), andmean win rateas the average of win rates across tasks\. Win rate depends only on relative rankings, and thus can be aggregated across tasks\. We report the individual\-level and distributional\-level mean win rates separately\. We release a set of public leaderboards using this ranking methodology444BehaviorBenchleaderboards:[https://umich\-foreseer\.github\.io/behaviorbench/](https://umich-foreseer.github.io/behaviorbench/)\., including the mean win rate rankings at both the individual and distributional levels \(top models listed in Figure[1](https://arxiv.org/html/2606.24162#S1.F1)\), together with per\-task win rate rankings\.
### 5\.3Individual\-level vs\. Distributional\-level Performance
Figure[1](https://arxiv.org/html/2606.24162#S1.F1)presents the aggregated evaluations and model rankings at both the individual and distributional levels\. The leaderboards show that, among all evaluated models, onlyBe\.FM\-1\.5and Gemini 3\.1 Pro achieve leading performance on both types of metrics\. Many frontier proprietary models, such as Claude Opus 4\.6 and GPT 4\.1, exhibit strong individual\-level prediction but weaker distributional alignment, suggesting that they remain limited in their ability to simulate population\-level behavioral patterns\. In contrast, several behavioral foundation models, includingBe\.FM\-1models, Centaur\-70B, and Socrates\-14B\-SFT, achieve stronger distributional alignment than individual\-level accuracy, indicating that fine\-tuning on behavior\-related data can help models better capture population heterogeneity\. We also observe that open\-source LLMs such as Llama\-3\.3\-70B and DeepSeek V3\.2 perform relatively poorly on both metrics\.
Notably,Be\.FM\-1\.5is fine\-tuned on data with no overlap withBehaviorBench, yet it leads on distributional metrics while remaining highly competitive on individual\-level metrics\. In particular,Be\.FM\-1\.5\-4BandBe\.FM\-1\.5\-70Brank first and second at the distributional level, respectively\. Meanwhile,Be\.FM\-1\.5\-70Branks third at the individual level, trailing only Gemini 3\.1 Pro and Claude Opus 4\.6, both of which are substantially larger models\.Be\.FM\-1\.5\-4Branks seventh at the individual level, still surpassing several proprietary models, including GPT 5\.4 mini, Gemini 3\.1 Flash, and Claude Haiku 4\.5\. The strong performance ofBe\.FM\-1\.5suggests that individual\-level accuracy and faithful distributional alignment can be achieved simultaneously, while also revealing substantial room to improve foundation models for behavioral science\.
### 5\.4Generalizability ofBe\.FM\-1\.5
A central goal of developing behavioral foundation models is to generalize across the subjects, contexts, and tasks\. In this section, we examineBe\.FM\-1\.5’s generalizability beyond training tasks and data\. We useBe\.FM\-1\.5for illustration as it is trained on a subset of tasks inBehaviorBench\.
Figure 2:Multi\-round behavior prediction accuracy on the Push/Pull game, which is an unseen context duringBe\.FM\-1\.5’s training\.##### Generalizing to unseen subjects\.
BehaviorBenchcontains held\-out subjects in the training ofBe\.FM\-1\.5, and we can examine how fine\-tuning enables generalization to these unseen subjects\. BothBe\.FM\-1\.5variants improve over their respective backbone models across all four behavioral capabilities evaluated by the benchmark, and the gains are most pronounced at the distributional level\. Aggregating the per\-task improvements within each capability,Be\.FM\-1\.5\-4Bimproves its alignment to the human distribution by 59\.6% on behavior prediction & simulation and by 63\.2% on subject\-trait inference, whileBe\.FM\-1\.5\-70Bimproves by 53\.7% and 45\.7%\. Individual\-level prediction also improves consistently across all four capabilities, although by smaller margins\. For example,Be\.FM\-1\.5\-4Bgains \+12\.3% on behavior prediction & simulation, \+7\.2% on subject\-trait inference, and \+4\.0% on scientific workflow prediction, and lifts the Beauty Contest win rate against human players from 9% to 48%\.Be\.FM\-1\.5\-70Bgains \+7\.6%, \+4\.2%, and \+10\.9% on the same three capabilities and lifts the win rate from 6% to 50%\. Overall, this suggests that fine\-tuning generalizes robustly to unseen subjects\.
##### Generalizing to unseen contexts\.
The Push/Pull game is a multi\-round Prisoner’s\-Dilemma\-like game, where in each round a player can choose to pull \(defect\) or push \(cooperate\)\. In this game, we evaluate whether a model can predict a human player’s next choice based on their choices in prior rounds\. This game is not used in the training ofBe\.FM\-1\.5models, thus representing an unseen context\. In multi\-round game behavior prediction555As in the downsampled MobLab data, the number of available sessions in later rounds is small, we evaluateBe\.FM\-1\.5and their backbone models with an extended evaluation set rather thanBehaviorBenchto examineBe\.FM\-1\.5’s generalizability\. This extended set contains 52,786 prediction examples from 13,991 unique play sessions\., bothBe\.FM\-1\.5\-70BandBe\.FM\-1\.5\-4Bshow improved prediction accuracy as the round number increases, indicating that they leverage accumulated player behavior more effectively with more historical records \(Figure[2](https://arxiv.org/html/2606.24162#S5.F2)\)\. Notably,Be\.FM\-1\.5\-70Bexceeds Llama3\.3\-70B across all rounds, showing that behavioral fine\-tuning transfers to a structurally novel context\.
##### Generalizing to unseen tasks\.
The IEO contest is a task thatBe\.FM\-1\.5has never seen during training\. On this task,Be\.FM\-1\.5\-70Breaches 73\.0%, which improves over its backbone model by 7%, indicating that training on behavior data helps the model generalize to knowledge\-application tasks\. We also note thatBe\.FM\-1\.5\-4Bshows the opposite pattern, dropping from 73\.9% \(Qwen3\-4B\) to 53\.3%\. One hypothesis for this regression is overfitting due to a smaller model size\. This might lead to a loss of reasoning capability which is required in answering certain questions\. As evidence of this hypothesis, we observe thatBe\.FM\-1\.5\-4Bhas a much shorter output sequence length than Qwen3\-4B, which indicates thatBe\.FM\-1\.5\-4Bloses the ability to generate step\-by\-step reasoning after fine\-tuning\. We present an in\-depth analysis and a case study in Appendix[D](https://arxiv.org/html/2606.24162#A4)\.
### 5\.5Contextual Reasoning
One limitation ofBehaviorBenchis that it does not include reasoning about contexts \(ccin Eq\.[1](https://arxiv.org/html/2606.24162#S1.E1)\) as an evaluation target\. This capability would include tasks such as inferring contextual factors or treatments from subject traits and outcome behaviors\. We exclude such tasks because the inferred contexts are difficult to evaluate quantitatively at scale\. Unlike behavioryy, which can be observed, or subject traitsxx, which can be self\-reported or measured, there could be multiple contextual factors that intertwine and lead to the outcomes\. Even when treatments are documented, assessing inference accuracy remains challenging because treatments are often complex and heterogeneous across experiments\. Appendix[C](https://arxiv.org/html/2606.24162#A3)provides a qualitative evaluation of foundation models on contextual reasoning\. Developing scalable quantitative evaluations for this capability is an important direction for future work\.
### 5\.6Prompting and Behavioral Simulation
An important consideration for behavioral simulation is the role of prompting\. In the current version ofBehaviorBench, prompts are standardized within each task to ensure a controlled comparison across models\. For example, models receive identical game instructions in economic games, while survey prediction tasks use prompts constructed only from available demographic information or observed responses\. This design isolates differences in model capability rather than prompt engineering, but the resulting scores should be read as a lower bound on what a given model can achieve: they reflect performance under a single fixed prompt \(or a fixed prompt pool\), not the best performance a researcher could obtain by adapting the prompt to the task and the target population\.
In practice, several prompting strategies can substantially improve behavioral simulation beyond the fixed\-prompt numbers reported here, and we highlight them so that practitioners can close the gap between benchmark scores and deployable performance\. At the individual level, grounding the prompt in richer subject context \(e\.g\., persona or demographic conditioning, role instructions, and in\-context examples drawn from observed human responses\) typically sharpens per\-subject prediction\(Argyleet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib9); Aheret al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib51); Parket al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib14)\)\. At the distributional level, where the key is to vary the prompt across the simulated population rather than to query a single prompt repeatedly\.Xieet al\.\([2025c](https://arxiv.org/html/2606.24162#bib.bib52)\)show that systematically varying system prompts \(“behavioral codes”\) steers an LLM across the full spectrum of human behaviors in classic economic games, and that the induced behavior distribution can be aligned with that of a target human population\. Relatedly, modeling the prompt itself as a distribution recovers population\-level heterogeneity that a fixed prompt collapses onto a single mode\(Xieet al\.,[2025a](https://arxiv.org/html/2606.24162#bib.bib25)\)\. Because these gains come from calibrating prompts against human data, reported results should distinguish zero\-shot, fixed\-prompt performance from performance obtained after such calibration, so that prompt tuning is not mistaken for an intrinsic property of the model\.
## 6Related Work
### 6\.1Foundation Models for Behavioral Science Tasks
General foundation models, including GPT and Llama, are increasingly being applied to behavioral science domains such as psychology, sociology, and economics\(Demszkyet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib6); Meng,[2024](https://arxiv.org/html/2606.24162#bib.bib7); Bail,[2024](https://arxiv.org/html/2606.24162#bib.bib8)\)\. These models have shown promise in predicting survey responses\(Argyleet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib9); Grossmannet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib10)\), enabling automated content analysis\(Wuet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib11); Yang and Menczer,[2023](https://arxiv.org/html/2606.24162#bib.bib12); Gilardiet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib13)\), supporting agent\-based modeling\(Parket al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib2),[2024](https://arxiv.org/html/2606.24162#bib.bib14)\), and inferring experimental results\(Hewittet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib15); Lippertet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib16)\)\. However, recent work has also revealed important limitations in directly applying general\-purpose LLMs to behavioral science tasks\. For example, these models often fail to capture population variance\(Bisbeeet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib53)\), exhibit sensitivity or bias toward specific data formats\(Dentellaet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib54)\), and underrepresent certain populations\(Santurkaret al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib28)\)\. This mismatch between the needs of behavioral science and the design of general foundation models calls for domain\-specific models that better align with behavioral science principles, data, and evaluation goals\.
A small but growing body of work has begun to address this need\. Monad\(Rychalskaet al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib35)\), for example, learns dense user representations from streams of clicks, transactions, and other behavioral events\. Centaur\(Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18)\)models human responses in cognitive\-psychology experiments, including tasks involving decision\-making, memory, supervised learning, and Markov decision processes\.Be\.FMmodels\(Xieet al\.,[2025b](https://arxiv.org/html/2606.24162#bib.bib17)\), trained on experimental, survey, and literature data, demonstrate capabilities in predicting human behavior, inferring subject characteristics, generating insights about contextual factors, and applying behavioral science knowledge to problem\-solving\. Socrates\(Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)adapts language models to social\-science survey data\. However, despite this growing body of work, we still lack a systematic understanding of how well foundation models perform across diverse behavioral science tasks, contexts, and populations\.
### 6\.2Benchmarks for Behavioral Science Tasks
A growing line of work has developed benchmarks for evaluating how well language models can reproduce human responses in surveys and social science experiments\. SubPOP\(Suhet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib20)\)targets the prediction of population\-response distributions for \(subpopulation, survey question\) pairs and shows that fine\-tuning on survey data substantially improves alignment with held\-out human responses\. SocioBench\(Wanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib21)\)predicts a survey respondent’s closed\-ended answer from their demographic profile, spanning ten sociological domains and over thirty countries\. Several other datasets could also potentially be developed into benchmarks of this type, including OpinionQA\(Santurkaret al\.,[2023](https://arxiv.org/html/2606.24162#bib.bib28)\), Psych\-101\(Binzet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib18)\), and SocSci210\(Kolluriet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib19)\)\. These benchmarks have established survey response prediction as a central task for evaluating LLMs as proxies for human respondents\. However, most prior works focus primarily on this single task formulation without exploring alternative formulations, such as predicting missing survey responses\.
Beyond predicting responses in surveys and social science experiments, recent works also evaluate LLMs in economic and strategic settings\. GTBench\(Duanet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib23)\)evaluates the strategic reasoning limitations of LLMs across a suite of classic game\-theoretic scenarios, and GAMA\-Bench\(Huanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib24)\)embeds competing language models within multi\-agent gaming environments\. CHBench\(Liuet al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib55)\)grounds evaluation in cognitive hierarchy theory, while Economics Arena\(Guoet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib31)\)benchmarks LLMs across a range of economic decision tasks\. GLEE\(Shapiraet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib22)\)provides a unified framework and benchmark for language\-based economic environments such as bargaining, negotiation, and persuasion games\. Most of these benchmarks have LLMs play against other LLMs or against heuristics, with no human opponents\. In our work, we use human play records to construct a proxy evaluation for LLM playing against human players\.
## 7Conclusion
We introduceBehaviorBench, a comprehensive benchmark that evaluates foundation models across four core capabilities with metrics at both the individual and distributional levels\. Our evaluation reveals that current foundation models exhibit uneven strengths across behavioral\-science tasks, and no single model family dominates the full benchmark\. Particularly, general\-purpose frontier proprietary LLMs excel at individual\-level prediction and knowledge\-intensive tasks\. In contrast, behavioral foundation models fine\-tuned on behavior\-related data tend to perform more strongly on distributional alignment\. Among all evaluated models, onlyBe\.FM\-1\.5and Gemini 3\.1 Pro achieve leading performance on both distributional and individual\-level metrics\. Notably,Be\.FM\-1\.5is fine\-tuned on data with no overlap withBehaviorBench, yet leads on distributional metrics while remaining highly competitive on individual\-level metrics\. This suggests that individual\-level accuracy and faithful distributional alignment can be achieved simultaneously, while also revealing substantial room for improving foundation models for behavioral science\. More broadly, by emphasizing distributional alignment,BehaviorBenchcan help assess whether AI systems reflect the heterogeneity of human populations rather than overfitting to majority or average behaviors\. Overall, our results highlight the importance of distributional evaluation and establishBehaviorBenchas a foundation for developing and assessing behaviorally aligned AI systems, and demonstrateBe\.FM\-1\.5’s potential for a broad range of behavioral science studies\.
## References
- G\. V\. Aher, R\. I\. Arriaga, and A\. T\. Kalai \(2023\)Using large language models to simulate multiple humans and replicate human subject studies\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research,pp\. 337–371\.Cited by:[§5\.6](https://arxiv.org/html/2606.24162#S5.SS6.p2.1)\.
- J\. Andreoni and B\. D\. Bernheim \(2009\)Social image and the 50–50 norm: a theoretical and experimental analysis of audience effects\.Econometrica77\(5\),pp\. 1607–1636\.Cited by:[1st item](https://arxiv.org/html/2606.24162#A3.I1.i1.p1.1)\.
- Anthropic \(2025\)Introducing claude haiku 4\.5\.Note:Anthropic BlogOctober 15, 2025External Links:[Link](https://www.anthropic.com/news/claude-haiku-4-5)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- Anthropic \(2026a\)Introducing claude opus 4\.6\.Note:Anthropic BlogFebruary 5, 2026External Links:[Link](https://www.anthropic.com/news/claude-opus-4-6)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- Anthropic \(2026b\)Introducing claude sonnet 4\.6\.Note:Anthropic BlogFebruary 17, 2026External Links:[Link](https://www.anthropic.com/news/claude-sonnet-4-6)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- L\. P\. Argyle, E\. C\. Busby, N\. Fulda, J\. R\. Gubler, C\. Rytting, and D\. Wingate \(2023\)Out of one, many: using language models to simulate human samples\.Political Analysis31\(3\),pp\. 337–351\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§5\.6](https://arxiv.org/html/2606.24162#S5.SS6.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- C\. A\. Bail \(2024\)Can generative ai improve social science?\.Proceedings of the National Academy of Sciences121\(21\),pp\. e2314021121\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- M\. Binz, E\. Akata, M\. Bethge, F\. Brändle, F\. Callaway, J\. Coda\-Forno, P\. Dayan, C\. Demircan, M\. K\. Eckstein, N\. Éltető,et al\.\(2024\)Centaur: a foundation model of human cognition\.arXiv preprint arXiv:2410\.20268\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p1.1),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[§4](https://arxiv.org/html/2606.24162#S4.p4.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p2.1),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p1.1)\.
- J\. Bisbee, J\. D\. Clinton, C\. Dorff, B\. Kenkel, and J\. M\. Larson \(2024\)Synthetic replacements for human survey data? the perils of large language models\.Political Analysis32\(4\),pp\. 401–416\.Cited by:[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- W\. Bleidorn, P\. L\. Hill, M\. D\. Back, J\. J\. Denissen, M\. Hennecke, C\. J\. Hopwood, M\. Jokela, C\. Kandler, R\. E\. Lucas, M\. Luhmann,et al\.\(2019\)The policy relevance of personality traits\.\.American psychologist74\(9\),pp\. 1056\.Cited by:[§2\.2\.3](https://arxiv.org/html/2606.24162#S2.SS2.SSS3.p1.2)\.
- G\. E\. Bolton, E\. Katok, and R\. Zwick \(1998\)Dictator game giving: rules of fairness versus acts of kindness\.International journal of game theory27\(2\),pp\. 269–299\.Cited by:[1st item](https://arxiv.org/html/2606.24162#A3.I1.i1.p1.1)\.
- R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. B\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill, E\. Brynjolfsson, S\. Buch, D\. Card, R\. Castellon, N\. S\. Chatterji, A\. S\. Chen, K\. Creel, J\. Q\. Davis, D\. Demszky, C\. Donahue, M\. Doumbouya, E\. Durmus, S\. Ermon, J\. Etchemendy, K\. Ethayarajh, L\. Fei\-Fei, C\. Finn, T\. Gale, L\. E\. Gillespie, K\. Goel, N\. D\. Goodman, S\. Grossman, N\. Guha, T\. Hashimoto, P\. Henderson, J\. Hewitt, D\. E\. Ho, J\. Hong, K\. Hsu, J\. Huang, T\. Icard, S\. Jain, D\. Jurafsky, P\. Kalluri, S\. Karamcheti, G\. Keeling, F\. Khani, O\. Khattab, P\. W\. Koh, M\. S\. Krass, R\. Krishna, R\. Kuditipudi, and et al\. \(2021\)On the opportunities and risks of foundation models\.CoRRabs/2108\.07258\.External Links:[Link](https://arxiv.org/abs/2108.07258),2108\.07258Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p1.1)\.
- P\. Brañas\-Garza \(2007\)Promoting helping behavior with framing in dictator games\.Journal of Economic Psychology28\(4\),pp\. 477–486\.Cited by:[4th item](https://arxiv.org/html/2606.24162#A3.I1.i4.p1.1)\.
- J\. M\. Brick and G\. Kalton \(1996\)Handling missing data in survey research\.Statistical methods in medical research5\(3\),pp\. 215–238\.Cited by:[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p6.9)\.
- J\. Dana, R\. A\. Weber, and J\. X\. Kuang \(2007\)Exploiting moral wiggle room: experiments demonstrating an illusory preference for fairness\.Economic Theory33\(1\),pp\. 67–80\.Cited by:[2nd item](https://arxiv.org/html/2606.24162#A3.I1.i2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.CoRRabs/2512\.02556\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.02556),[Document](https://dx.doi.org/10.48550/ARXIV.2512.02556),2512\.02556Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p2.1)\.
- D\. Demszky, D\. Yang, D\. S\. Yeager, C\. J\. Bryan, M\. Clapper, S\. Chandhok, J\. C\. Eichstaedt, C\. Hecht, J\. Jamieson, M\. Johnson,et al\.\(2023\)Using large language models in psychology\.Nature Reviews Psychology2\(11\),pp\. 688–701\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§2\.2\.3](https://arxiv.org/html/2606.24162#S2.SS2.SSS3.p1.2),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- V\. Dentella, F\. Günther, and E\. Leivada \(2023\)Systematic testing of three language models reveals low language accuracy, absence of response stability, and a yes\-response bias\.Proceedings of the National Academy of Sciences120\(51\),pp\. e2309583120\.Cited by:[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- J\. Duan, R\. Zhang, J\. Diffenderfer, B\. Kailkhura, L\. Sun, E\. Stengel\-Eskin, M\. Bansal, T\. Chen, and K\. Xu \(2024\)GTBench: uncovering the strategic reasoning limitations of llms via game\-theoretic evaluations\.CoRRabs/2402\.12348\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.12348),[Document](https://dx.doi.org/10.48550/ARXIV.2402.12348),2402\.12348Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.5),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.24162#S2.T2.17.17.17.3),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p2.1)\.
- C\. C\. Eckel and P\. J\. Grossman \(1998\)Are women less selfish than men?: evidence from dictator experiments\.The economic journal108\(448\),pp\. 726–735\.Cited by:[6th item](https://arxiv.org/html/2606.24162#A3.I1.i6.p1.1)\.
- C\. Engel \(2011\)Dictator games: a meta study\.Experimental economics14\(4\),pp\. 583–610\.Cited by:[1st item](https://arxiv.org/html/2606.24162#A3.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2606.24162#A3.I1.i3.p1.1),[4th item](https://arxiv.org/html/2606.24162#A3.I1.i4.p1.1),[6th item](https://arxiv.org/html/2606.24162#A3.I1.i6.p1.1),[Table 6](https://arxiv.org/html/2606.24162#A3.T6),[Table 6](https://arxiv.org/html/2606.24162#A3.T6.14.2),[Appendix C](https://arxiv.org/html/2606.24162#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.24162#A3.p4.1)\.
- F\. Gilardi, M\. Alizadeh, and M\. Kubli \(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- H\. Gintis \(2014\)The bounds of reason: game theory and the unification of the behavioral sciences\-revised edition\.Princeton university press\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p1.1)\.
- J\. K\. Goeree, M\. A\. McConnell, T\. Mitchell, T\. Tromp, and L\. Yariv \(2010\)The 1/d law of giving\.American Economic Journal: Microeconomics2\(1\),pp\. 183–203\.Cited by:[5th item](https://arxiv.org/html/2606.24162#A3.I1.i5.p1.1)\.
- Google \(2026\)Gemini 3\.1 pro: a smarter model for your most complex tasks\.Note:Google BlogMarch 19, 2026External Links:[Link](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- I\. Grossmann, M\. Feinberg, D\. C\. Parker, N\. A\. Christakis, P\. E\. Tetlock, and W\. A\. Cunningham \(2023\)AI and the transformation of social science research\.Science380\(6650\),pp\. 1108–1109\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- S\. Guo, H\. Bu, H\. Wang, Y\. Ren, D\. Sui, Y\. Shang, and S\. Lu \(2024\)Economics arena for large language models\.CoRRabs/2401\.01735\.External Links:[Link](https://doi.org/10.48550/arXiv.2401.01735),[Document](https://dx.doi.org/10.48550/ARXIV.2401.01735),2401\.01735Cited by:[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.5),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p2.1)\.
- J\. Henrich, R\. Boyd, S\. Bowles, C\. Camerer, E\. Fehr, H\. Gintis, R\. McElreath, M\. Alvard, A\. Barr, J\. Ensminger,et al\.\(2005\)“Economic man” in cross\-cultural perspective: behavioral experiments in 15 small\-scale societies\.Behavioral and brain sciences28\(6\),pp\. 795–815\.Cited by:[6th item](https://arxiv.org/html/2606.24162#A3.I1.i6.p1.1)\.
- L\. Hewitt, A\. Ashokkumar, I\. Ghezae, and R\. Willer \(2024\)Predicting results of social science experiments using large language models\.Preprint\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- T\. Ho, C\. Camerer, and K\. Weigelt \(1998\)Iterated dominance and iterated best response in experimental" p\-beauty contests"\.The American Economic Review88\(4\),pp\. 947–969\.Cited by:[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p2.5)\.
- E\. Hoffman, K\. McCabe, K\. Shachat, and V\. Smith \(1994\)Preferences, property rights, and anonymity in bargaining games\.Games and Economic behavior7\(3\),pp\. 346–380\.Cited by:[4th item](https://arxiv.org/html/2606.24162#A3.I1.i4.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§3\.1](https://arxiv.org/html/2606.24162#S3.SS1.p1.3)\.
- J\. Huang, E\. J\. Li, M\. H\. Lam, T\. Liang, W\. Wang, Y\. Yuan, W\. Jiao, X\. Wang, Z\. Tu, and M\. R\. Lyu \(2025\)Competing large language models in multi\-agent gaming environments\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=DI4gW8viB6)Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.5),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.24162#S2.T2.19.19.19.3),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p2.1)\.
- M\. O\. Jackson, Q\. Mei, S\. W\. Wang, Y\. Xie, W\. Yuan, S\. Benzell, E\. Brynjolfsson, C\. F\. Camerer, J\. Evans, B\. Jabarian, J\. M\. Kleinberg, J\. Meng, S\. Mullainathan, A\. Ozdaglar, T\. Pfeiffer, M\. Tennenholtz, R\. Willer, D\. Yang, and T\. Ye \(2025\)AI behavioral science\.CoRRabs/2509\.13323\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.13323),[Document](https://dx.doi.org/10.48550/ARXIV.2509.13323),2509\.13323Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p1.1),[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.6)\.
- A\. Kolluri, S\. Wu, J\. S\. Park, and M\. S\. Bernstein \(2025\)Finetuning llms for human behavior prediction in social science experiments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 30096–30111\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.1530),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.1530)Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[2nd item](https://arxiv.org/html/2606.24162#S2.I2.i2.p1.1),[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p1.1),[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p6.9),[§2\.3](https://arxiv.org/html/2606.24162#S2.SS3.p2.2),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[§4](https://arxiv.org/html/2606.24162#S4.p4.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p2.1),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p1.1)\.
- S\. Leider, M\. M\. Möbius, T\. Rosenblat, and Q\. Do \(2009\)Directed altruism and enforced reciprocity in social networks\.The Quarterly Journal of Economics124\(4\),pp\. 1815–1851\.Cited by:[5th item](https://arxiv.org/html/2606.24162#A3.I1.i5.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Ré, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. Wang, K\. Santhanam, L\. J\. Orr, L\. Zheng, M\. Yüksekgönül, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2022\)Holistic evaluation of language models\.CoRRabs/2211\.09110\.External Links:[Link](https://doi.org/10.48550/arXiv.2211.09110),[Document](https://dx.doi.org/10.48550/ARXIV.2211.09110),2211\.09110Cited by:[§5\.2](https://arxiv.org/html/2606.24162#S5.SS2.p1.1)\.
- S\. Lippert, A\. Dreber, M\. Johannesson, W\. Tierney, W\. Cyrus\-Lai, E\. L\. Uhlmann, T\. Pfeiffer, E\. E\. Collaboration,et al\.\(2024\)Can large language models help predict results from a complex behavioural science study?\.Royal Society Open Science11\(9\)\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- R\. J\. Little and D\. B\. Rubin \(2019\)Statistical analysis with missing data\.John Wiley & Sons\.Cited by:[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p6.9)\.
- H\. Liu, Z\. Du, Z\. Wang, and W\. Shen \(2025\)CHBench: A cognitive hierarchy benchmark for evaluating strategic reasoning capability of llms\.CoRRabs/2508\.11944\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.11944),[Document](https://dx.doi.org/10.48550/ARXIV.2508.11944),2508\.11944Cited by:[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p2.1)\.
- Q\. Mei, Y\. Xie, W\. Yuan, and M\. O\. Jackson \(2024\)A turing test of whether ai chatbots are behaviorally similar to humans\.Proceedings of the National Academy of Sciences121\(9\),pp\. e2313925121\.Cited by:[§A\.1](https://arxiv.org/html/2606.24162#A1.SS1.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2606.24162#A8.SS0.SSS0.Px1.p1.1),[item 1](https://arxiv.org/html/2606.24162#S2.I1.i1.p1.1),[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.6)\.
- J\. Meng \(2024\)AI emerges as the frontier in behavioral science\.Proceedings of the National Academy of Sciences121\(10\),pp\. e2401336121\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- R\. Nagel \(1995\)Unraveling in guessing games: an experimental study\.The American economic review85\(5\),pp\. 1313–1326\.Cited by:[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p2.5)\.
- OpenAI \(2025a\)Introducing gpt\-4\.1 in the api\.Note:OpenAI BlogApril 14, 2025External Links:[Link](https://openai.com/index/gpt-4-1/)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- OpenAI \(2025b\)Introducing gpt\-5\.Note:OpenAI BlogAugust 7, 2025External Links:[Link](https://openai.com/index/introducing-gpt-5/)Cited by:[§A\.1](https://arxiv.org/html/2606.24162#A1.SS1.SSS0.Px3.p1.1)\.
- OpenAI \(2026a\)Introducing gpt\-5\.4 mini and nano\.Note:OpenAI BlogMarch 5, 2026External Links:[Link](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- OpenAI \(2026b\)Introducing gpt\-5\.4\.Note:OpenAI BlogMarch 5, 2026External Links:[Link](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§4](https://arxiv.org/html/2606.24162#S4.p3.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023\- 1 November 2023,S\. Follmer, J\. Han, J\. Steimle, and N\. H\. Riche \(Eds\.\),pp\. 2:1–2:22\.External Links:[Link](https://doi.org/10.1145/3586183.3606763),[Document](https://dx.doi.org/10.1145/3586183.3606763)Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p1.1),[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- J\. S\. Park, C\. Q\. Zou, A\. Shaw, B\. M\. Hill, C\. J\. Cai, M\. R\. Morris, R\. Willer, P\. Liang, and M\. S\. Bernstein \(2024\)Generative agent simulations of 1,000 people\.CoRRabs/2411\.10109\.External Links:[Link](https://doi.org/10.48550/arXiv.2411.10109),[Document](https://dx.doi.org/10.48550/ARXIV.2411.10109),2411\.10109Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§5\.6](https://arxiv.org/html/2606.24162#S5.SS6.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- M\. Rigdon, K\. Ishii, M\. Watabe, and S\. Kitayama \(2009\)Minimal social cues in the dictator game\.Journal of Economic Psychology30\(3\),pp\. 358–367\.Cited by:[2nd item](https://arxiv.org/html/2606.24162#A3.I1.i2.p1.1)\.
- B\. Rychalska, S\. Lukasik, and J\. Dabrowski \(2023\)Synerise monad: A foundation model for behavioral event data\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23\-27, 2023,H\. Chen, W\. \(\. Duh, H\. Huang, M\. P\. Kato, J\. Mothe, and B\. Poblete \(Eds\.\),pp\. 3344–3348\.External Links:[Link](https://doi.org/10.1145/3539618.3591851),[Document](https://dx.doi.org/10.1145/3539618.3591851)Cited by:[§2\.2\.3](https://arxiv.org/html/2606.24162#S2.SS2.SSS3.p1.2),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p2.1)\.
- S\. Santurkar, E\. Durmus, F\. Ladhak, C\. Lee, P\. Liang, and T\. Hashimoto \(2023\)Whose opinions do language models reflect?\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research,pp\. 29971–30004\.External Links:[Link](https://proceedings.mlr.press/v202/santurkar23a.html)Cited by:[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p6.9),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p1.1)\.
- T\. Sellam, D\. Das, and A\. P\. Parikh \(2020\)BLEURT: learning robust metrics for text generation\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 7881–7892\.External Links:[Link](https://doi.org/10.18653/v1/2020.acl-main.704),[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.704)Cited by:[4th item](https://arxiv.org/html/2606.24162#S2.I2.i4.p1.1)\.
- E\. Shapira, O\. Madmon, I\. Reinman, S\. J\. Amouyal, R\. Reichart, and M\. Tennenholtz \(2024\)GLEE: A unified framework and benchmark for language\-based economic environments\.CoRRabs/2410\.05254\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.05254),[Document](https://dx.doi.org/10.48550/ARXIV.2410.05254),2410\.05254Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[§2\.2\.2](https://arxiv.org/html/2606.24162#S2.SS2.SSS2.p1.5),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.24162#S2.T2.15.15.15.3),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p2.1)\.
- J\. Suh, E\. Jahanparast, S\. Moon, M\. Kang, and S\. Chang \(2025\)Language model fine\-tuning on scaled survey data for predicting distributions of public opinions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 21147–21170\.External Links:[Link](https://aclanthology.org/2025.acl-long.1028/)Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[2nd item](https://arxiv.org/html/2606.24162#S2.I2.i2.p1.1),[§2\.3](https://arxiv.org/html/2606.24162#S2.SS3.p2.2),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.24162#S2.T2.13.13.13.3),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p1.1)\.
- L\. Team \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§3\.1](https://arxiv.org/html/2606.24162#S3.SS1.p1.3),[§4](https://arxiv.org/html/2606.24162#S4.p2.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§3\.1](https://arxiv.org/html/2606.24162#S3.SS1.p1.3),[§4](https://arxiv.org/html/2606.24162#S4.p2.1)\.
- H\. Wang, T\. Fu, Y\. Du, W\. Gao, K\. Huang, Z\. Liu, P\. Chandak, S\. Liu, P\. Van Katwyk, A\. Deac,et al\.\(2023\)Scientific discovery in the age of artificial intelligence\.Nature620\(7972\),pp\. 47–60\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p1.1)\.
- J\. Wang, Z\. Zhao, T\. Ni, and Z\. Wei \(2025\)SocioBench: modeling human behavior in sociological surveys with large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 26257–26289\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.1335),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.1335)Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p1.1),[§2\.2\.1](https://arxiv.org/html/2606.24162#S2.SS2.SSS1.p6.9),[§2\.4](https://arxiv.org/html/2606.24162#S2.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.24162#S2.T2.11.11.11.3),[§6\.2](https://arxiv.org/html/2606.24162#S6.SS2.p1.1)\.
- P\. Y\. Wu, J\. Nagler, J\. A\. Tucker, and S\. Messing \(2023\)Large language models can be used to scale the ideologies of politicians in a zero\-shot learning setting\.preprint\.Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- Y\. Xie, R\. Gao, and Q\. Mei \(2025a\)Distributional alignment for social simulation with llms: a prompt mixture modeling approach\.InFirst Workshop on Social Simulation with LLMs,Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p3.1),[§5\.6](https://arxiv.org/html/2606.24162#S5.SS6.p2.1)\.
- Y\. Xie, Z\. Li, X\. Wang, Y\. Pan, Q\. Liu, X\. Cui, K\. Lo, R\. Gao, X\. Zhang, J\. Huang, W\. Yuan, M\. O\. Jackson, and Q\. Mei \(2025b\)Be\.fm: open foundation models for human behavior\.CoRRabs/2505\.23058\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.23058),[Document](https://dx.doi.org/10.48550/ARXIV.2505.23058),2505\.23058Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§1](https://arxiv.org/html/2606.24162#S1.p4.9),[§1](https://arxiv.org/html/2606.24162#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.24162#S2.SS1.p1.1),[§4](https://arxiv.org/html/2606.24162#S4.p4.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p2.1)\.
- Y\. Xie, Q\. Mei, W\. Yuan, and M\. O\. Jackson \(2025c\)Using large language models to categorize strategic situations and decipher motivations behind human behaviors\.Proceedings of the National Academy of Sciences122\(35\),pp\. e2512075122\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2512075122)Cited by:[§5\.6](https://arxiv.org/html/2606.24162#S5.SS6.p2.1)\.
- K\. Yang and F\. Menczer \(2023\)Large language models can rate news outlet credibility\.CoRRabs/2304\.00228\.External Links:[Link](https://doi.org/10.48550/arXiv.2304.00228),[Document](https://dx.doi.org/10.48550/ARXIV.2304.00228),2304\.00228Cited by:[§1](https://arxiv.org/html/2606.24162#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.24162#S6.SS1.p1.1)\.
- Y\. Yang and P\. Zhai \(2022\)Click\-through rate prediction in online advertising: a literature review\.Information Processing & Management59\(2\),pp\. 102853\.Cited by:[§2\.2\.3](https://arxiv.org/html/2606.24162#S2.SS2.SSS3.p1.2)\.
- X\. Zhang, Y\. Xie, J\. Huang, J\. Ma, Z\. Pan, Q\. Liu, Z\. Xiong, T\. Ergen, D\. Shim, H\. Lee, and Q\. Mei \(2025\)MASSW: A new dataset and benchmark tasks for ai\-assisted scientific workflows\.InFindings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Findings of ACL,pp\. 2373–2394\.External Links:[Link](https://doi.org/10.18653/v1/2025.findings-naacl.127),[Document](https://dx.doi.org/10.18653/V1/2025.FINDINGS-NAACL.127)Cited by:[§A\.1](https://arxiv.org/html/2606.24162#A1.SS1.SSS0.Px3.p1.1),[item 3](https://arxiv.org/html/2606.24162#S2.I1.i3.p1.1),[4th item](https://arxiv.org/html/2606.24162#S2.I2.i4.p1.1),[§2\.2\.4](https://arxiv.org/html/2606.24162#S2.SS2.SSS4.p2.1)\.
- Y\. Zhao, J\. Huang, J\. Hu, X\. Wang, Y\. Mao, D\. Zhang, Z\. Jiang, Z\. Wu, B\. Ai, A\. Wang, W\. Zhou, and Y\. Chen \(2024\)SWIFT:a scalable lightweight infrastructure for fine\-tuning\.External Links:2408\.05517,[Link](https://arxiv.org/abs/2408.05517)Cited by:[§3\.1](https://arxiv.org/html/2606.24162#S3.SS1.p1.3)\.
## Appendix Contents
## Appendix AData Details
### A\.1Data Sources
This subsection provides a more detailed walkthrough of the data sources introduced in Section[2\.1](https://arxiv.org/html/2606.24162#S2.SS1)\.
##### Experimental data\.
Experimental data document the behavioral patterns of varied participant groups under controlled conditions, providing crucial insights into how individuals and groups make decisions and engage socially\. We utilize a collection of experiments from MobLab, a platform for behavioral science and economic games\. After cleaning, this source contains 526,989 behavioral observations from 91,860 unique subjects collected over a nine\-year period \(2015 to 2023\)\[Meiet al\.,[2024](https://arxiv.org/html/2606.24162#bib.bib26)\]\. The collection covers seven classic economic games across nine scenarios: Dictator, Ultimatum \(Proposer and Responder\), Trust \(Investor and Banker\), Public Goods, Bomb Risk, Beauty Contest, and Push/Pull \(Prisoner’s Dilemma\)\.
##### Survey data\.
Survey data offer self\-reported information about personality traits, attitudes, and preferences\. They primarily characterize subject attributes \(xxin Equation[1](https://arxiv.org/html/2606.24162#S1.E1)\), though they may also contain self\-reported behaviorsy~\\tilde\{y\}, which serve as a proxy for observed behaviorsyyand can be leveraged to simulate responses to surveys not yet conducted\. Such responses may be elicited either without any specified context or under particular pre\-defined scenariosccembedded in the survey items\. We adopt the publicly available Big Five Personality Test dataset666[https://www\.kaggle\.com/datasets/lucasgreenwell/ocean\-five\-factor\-personality\-test\-responses](https://www.kaggle.com/datasets/lucasgreenwell/ocean-five-factor-personality-test-responses)\., which contains demographic attributes \(race, age, and gender\) of participants along with their single\-choice responses to 50 items spanning five personality dimensions: openness, conscientiousness, extraversion, agreeableness, and neuroticism\. The dataset comprises 19,630 subjects and a total of 981,500 survey responses\.
##### Literature data\.
The behavioral science literature captures established theories and empirical findings on human behavior, serving as a proxy for the behavioral knowledge𝕂\\mathbb\{K\}in Equation[1](https://arxiv.org/html/2606.24162#S1.E1)\. We gather titles and abstracts from two leading interdisciplinary venues for behavioral research: the*American Economic Review*\(AER\) and*Nature Human Behaviour*\(NHB\)\. The resulting corpus consists of 6,515 AER articles spanning 1911 to 2025 and 871 NHB articles spanning 2016 to 2025\. Following the approach of MASSW\[Zhanget al\.,[2025](https://arxiv.org/html/2606.24162#bib.bib27)\], we summarize each title–abstract pair into a structured research workflow capturing the research context, key idea, method, outcome, and anticipated impact using GPT\-5\[OpenAI,[2025b](https://arxiv.org/html/2606.24162#bib.bib56)\]\.
##### Economics contest data\.
We also use multiple\-choice questions from the International Economics Olympiad \(IEO\) as an evaluation\-only source for economics problem solving\. This source differs from the experimental and survey datasets because it does not record human behavioral variation\. Instead, it evaluates whether models can represent, reason about, and apply domain\-specific economic knowledge to unseen problems\.
### A\.2Data Splits
This subsection describes how each raw source from Section[A\.1](https://arxiv.org/html/2606.24162#A1.SS1)is partitioned into theBe\.FM\-1\.5fine\-tuning set and theBehaviorBenchevaluation set, and how we sample task instances from each partition\. Table[5](https://arxiv.org/html/2606.24162#A1.T5)summarizes the per\-task entry counts and the split rule used\. Across all sources, the partition is constructed so that no subject \(or paper\) appears on both the fine\-tuning and evaluation side of any task\.
##### MobLab economic games\.
We split unique participants \(identified by a uniqueUserID\) 70/10/20 into train, validation, and test, so a participant’s complete history of game play remains within a single partition\. The Game Behavior Sim\., Multi\-Round Pred\., Across\-Ctx Pred\., and Strategic Game Play tasks are all generated from these splits\. Raw game logs are first cleaned by a filtering pipeline that removes invalid game choices\.
##### The Push/Pull game \(held\-out\)\.
The Push/Pull game is excluded from fine\-tuning entirely\. When constructing the Push/Pull evaluation, we additionally drop any subject that appears in the MobLab train or validation split, so no participant the model has seen during fine\-tuning contributes to this held\-out evaluation\.
##### Big Five survey\.
We use a 90/10 split over subjects\. All Big Five tasks \(Demo\. to Resp\., Seq\. Resp\. Pred\., Masked Resp\. Pred\., Demo\. to Pers\., Across\-Dim Pers\. Pred\., and Pers\. to Demo\.\) draw their fine\-tuning and evaluation instances from this single participant\-level split\.
##### Scientific workflows\.
We split AER and NHB papers by publication year\. Papers published in 2025 form the workflow\-prediction test set, and all earlier years form the fine\-tuning set\. This temporal cut reflects the practical use case of generalizing to newly published papers\.
##### Economics contest \(IEO\)\.
The IEO multiple\-choice problems serve as an evaluation\-only source and have no corresponding fine\-tuning data\.
##### Sampling and downsampling\.
For tasks where the natural enumeration produces far more instances than we need for either fine\-tuning or evaluation, we downsample to a fixed target size\. All Big Five evaluation tasks are capped at 1K examples\. MobLab game tasks use random sampling per game, with a cap of 200, 500, and 750 examples per game for Game Behav\. Sim\., Multi\-Round Pred\., and Across\-Ctx Pred\. respectively\. The Strategic Game Play evaluation set is capped at 1K examples\. The final per\-task counts after sampling are reported in Table[5](https://arxiv.org/html/2606.24162#A1.T5)\.
CapabilityTask\# Train\# EvalTrain/Test SplitBehavior prediction and simulationGame Behav\. Sim\.78\.8K200/gameby subjectMulti\-Round Pred\.57\.9K500/gameby subjectAcross\-Ctx Pred\.44\.1K750/gameby subjectDemo\. To Resp\.88\.3K1,000by subjectSeq\. Resp\. Pred\.36\.0K1,000by subjectMasked Resp\. Pred\.40\.0K1,000by subjectSubject\-trait inferenceDemo\. To Pers\.88\.3K1,000by subjectAcross\-Dim Pers\. Pred\.88\.3K1,000by subjectPers\. To Demo\.17\.7K1,000by subjectStrategic decision\-makingStrategic Game Play85\.2K1,000by subjectBehavioral knowledge applicationSci\. Workflow Pred\.24\.7K1,084by publication yearEconomics Contest—124eval only
Table 5:Training and evaluation data summary forBe\.FM\-1\.5\.
### A\.3Evaluation Tasks inBehaviorBench
For each evaluation task inBehaviorBench, we briefly describe the task below and provide its prompt template\.
##### Single\-round behavior simulation \(Game Behav\. Sim\.\)\.
The model is shown the rules of a single MobLab economic game and asked for its first\-round move\. The game\-specific instructions cover the Dictator game, the Ultimatum game \(as Proposer or Responder\), the Trust game \(as Investor or Banker\), the Public Goods game, the Bomb Risk game, the Beauty Contest game, and the Push/Pull game \(Prisoner’s Dilemma\) \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7)\)\.
##### Multi\-round behavior prediction \(Multi\-Round Pred\.\)\.
The model is shown the game instruction together with the subject’s prior rounds in the same session, and predicts the next round’s move \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx1.SSS0.Px10)\)\.
##### Across\-context single\-round prediction \(Across\-Ctx Pred\.\)\.
The model is shown a subject’s first\-round moves in several other games and asked to predict the subject’s first\-round move in a target game \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx2.SSS0.Px5)\)\.
##### Survey response prediction given demographics \(Demo\. to Resp\.\)\.
The model is given a subject’s demographics and asked to predict one Big Five item of this subject on a five\-point Likert scale \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px1)\)\.
##### Sequential survey response prediction \(Seq\. Resp\. Pred\.\)\.
The model is shown the subject’s previous answers in the original survey order \(interleaved across the five dimensions\) and predicts the subject’s response to the next item \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px3)\)\.
##### Masked survey response prediction \(Masked Resp\. Pred\.\)\.
The model is shown 49 of a subject’s 50 Big Five item responses, grouped by dimension, and predicts the held\-out item \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px4)\)\.
##### Personality score from demographics \(Demo\. to Pers\.\)\.
The model is given a subject’s demographics and predicts the subject’s score on one Big Five dimension \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4)\)\.
##### Across\-dimension personality prediction \(Across\-Dim Pers\. Pred\.\)\.
The model is given a subject’s scores on four Big Five dimensions and predicts the score on the held\-out dimension \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px5)\)\.
##### Age prediction from personality \(Pers\. to Demo\.\)\.
The model is given a subject’s scores on all five Big Five dimensions and predicts the subject’s age \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px2)\)\.
##### Strategic game play\.
The model plays the Beauty Contest game against human players\. Given the play history and the group average from each prior round, the model picks an integer in\[0,100\]\[0,100\]that aims to be closest to two\-thirds of the next group average \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx1.SSS0.Px8)\)\.
##### Scientific workflow prediction \(Sci\. Workflow Pred\.\)\.
Five generation sub\-tasks \(idea generation, method recommendation, outcome prediction, impact prediction, and title prediction\), each predicting one workflow aspect from the preceding aspects of the same paper \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx4.SSS0.Px6)\)\.
##### Economics contest \(IEO\)\.
The model selects A/B/C/D for each multiple\-choice problem from the International Economics Olympiad \([\[PROMPT\]](https://arxiv.org/html/2606.24162#A7.SSx5.SSS0.Px2)\)\.
### A\.4Fine\-tuning Data Recipe forBe\.FM\-1\.5
We instantiate eleven fine\-tuning tasks that span the three data sources of Section[A\.1](https://arxiv.org/html/2606.24162#A1.SS1)and cover the four capability axes ofBehaviorBench\. All instances are drawn from the fine\-tuning partitions defined in Section[A\.2](https://arxiv.org/html/2606.24162#A1.SS2), so no subject or paper that appears in theBehaviorBenchevaluation set ever appears in fine\-tuning\. The final fine\-tuning mix contains 826,064 examples\. Per\-task counts are listed in Table[5](https://arxiv.org/html/2606.24162#A1.T5)\. We describe the tasks below grouped by theBehaviorBenchcapability they target\.
##### Behavior prediction and simulation\.
Six tasks\. Three on MobLab games: single\-round behavior simulation, multi\-round behavior prediction, and across\-context single\-round prediction\. Three on the Big Five survey: response prediction from demographics, sequential response prediction in survey order, and masked response prediction over the full 50\-item survey\.
##### Subject\-trait inference\.
Three Big Five tasks: dimension\-score prediction from demographics, leave\-one\-dimension prediction across the five dimensions, and age prediction from the five dimension scores\. The dimension\-score\-from\-demographics task is upsampled by a factor of three in the final mix \(three identical copies are shuffled together with the rest\)\. This is because preliminary experiments show that this task requires more training to improve on the validation set\.
##### Strategic decision\-making\.
One task: strategic Beauty Contest play\. The Beauty Contest fine\-tuning data uses a multi\-winner expansion: when several integers tie as winners against the recorded other players in a round, each winning integer becomes a separate fine\-tuning example \(capped at 10 per round\)\.
##### Behavioral knowledge application\.
One task with five sub\-tasks \(idea generation, method recommendation, outcome prediction, impact prediction, and title prediction\), each predicting one workflow aspect from the preceding aspects\.
## Appendix BMetrics and Evaluation Details
##### Wasserstein normalization and aggregation\.
In tasks constructed from economic games, players’ choice ranges differ across games \(e\.g\., in the Dictator game, a player can choose a number between 0 and 100, but for the Public Goods game, a player can choose between 0 and 20\)\. We normalize the players’ choices into range\[0,100\]\[0,100\]and computeWWscore for every game\. Then we average theseWWscore across all games\. A lowerWWscore indicates that the predicted distributions are closer to observed human behavior distributions\.
For tasks constructed from survey data, the responses in the Big Five Personality Test are restricted to integers 1 to 5, so there is no need for normalization\. For Demo\. To Resp\., Seq\. Resp\. Pred\., and Masked Resp\. Pred\. tasks, we treat each survey item as a separate prediction task and report the averageWWscore over them\. For example, in the Demo\. To Resp\. task, there are 50 questions in the evaluation set, and we compute theWWscore between the model\-predicted distribution and the human distribution for each question\. Finally, we average across theWWscore from 50 questions\.
##### Evaluation details\.
All models are evaluated five times on IEO due to its small size, and we report the mean and standard deviation across the five runs\. The two Be\.FM\-1\.5 models \(Be\.FM\-1\.5\-70B and Be\.FM\-1\.5\-4B\) are also evaluated five times on every other BehaviorBench task, with the mean reported\.
## Appendix CBe\.FM\-1\.5Can Generate Insights about Contextual Factors
Human behavior is influenced by a wide range of contextual factors, including incentive mechanisms, social visibility, framing conditions, and demographic or interpersonal characteristics\. Identifying the contextual drivers underlying observed behaviors is inherently difficult, as similar behavioral outcomes can emerge from distinct experimental settings or interventions\. Formally, this can be represented as a projection
𝒳×𝒴×𝕂→𝒞,\\mathcal\{X\}\\times\\mathcal\{Y\}\\times\\mathbb\{K\}\\to\\mathcal\{C\},\(5\)where an input\(x,g\(y\),k\)∈𝒳×𝒴×𝕂\(x,g\(y\),k\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}\\times\\mathbb\{K\}is mapped to an outputc∈𝒞c\\in\\mathcal\{C\}, wherexxdenotes a subject’s traits,g\(y\)g\(y\)represents a behavioral objective or observed behavioral pattern derived from behavioryy,kkdenotes prior behavioral knowledge, andcccorresponds to a contextual variable\.
Evaluating a model’s contextual factor inference capability is challenging because there is no standard dataset, and relevant information is scattered across the behavioral science literature\. Here we use the Dictator Game meta\-study byEngel \[[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]to conduct a qualitative analysis ofBe\.FM’s contextual inference capability\. The study organizes interventions into six categories: incentives, social control, distributive concerns, framing, social distance, and demographic factors\.
Given a target behavioral objective such as increasing generosity,Be\.FMgenerates interventions aligned with the six intervention categories, including:
- •Incentive manipulations, such as constrained action spaces\[Boltonet al\.,[1998](https://arxiv.org/html/2606.24162#bib.bib58)\], payoff uncertainty\[Andreoni and Bernheim,[2009](https://arxiv.org/html/2606.24162#bib.bib59)\], varying stake sizes\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\], and repeated interactions\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]\.
- •Social\-control manipulations, such as social\-observability cues\[Rigdonet al\.,[2009](https://arxiv.org/html/2606.24162#bib.bib60)\]and concealment mechanisms\[Danaet al\.,[2007](https://arxiv.org/html/2606.24162#bib.bib61)\]\.
- •Distributive\-concern manipulations, such as recipient earned entitlement\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]and payoff multipliers\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]\.
- •Framing manipulations, such as market framing\[Hoffmanet al\.,[1994](https://arxiv.org/html/2606.24162#bib.bib62)\], recipient\-dependence cues\[Brañas\-Garza,[2007](https://arxiv.org/html/2606.24162#bib.bib63)\], and real\-money interactions\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]\.
- •Social\-distance manipulations, such as interpersonal proximity\[Leideret al\.,[2009](https://arxiv.org/html/2606.24162#bib.bib64)\]and indirect social connectedness\[Goereeet al\.,[2010](https://arxiv.org/html/2606.24162#bib.bib65)\]\.
- •Demographic and cultural manipulations, including gender differences\[Eckel and Grossman,[1998](https://arxiv.org/html/2606.24162#bib.bib66)\], age variation\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\], and cross\-cultural sharing norms\[Henrichet al\.,[2005](https://arxiv.org/html/2606.24162#bib.bib67)\]\.
As shown in Table[6](https://arxiv.org/html/2606.24162#A3.T6),Be\.FMcovers most of the interventions documented in theEngel \[[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]meta\-study across all six categories\. The model also proposes plausible interventions beyond those summarized in the meta\-analysis, such as time pressure, cognitive load, and probabilistic recipient assignment, suggesting thatBe\.FMcan generalize beyond observed experimental configurations and support hypothesis generation for future behavioral studies\.
Table 6:Generated interventions byBe\.FM\-1\.5\-4Bto increase and decrease sharing behavior in the Dictator Game\.Be\.FM\-1\.5\-4Bcorrectly predicts most of the interventions in a comprehensive meta study on the Dictator Game\[Engel,[2011](https://arxiv.org/html/2606.24162#bib.bib57)\]\. Prompts are provided[here](https://arxiv.org/html/2606.24162#A7.SSx6)\.CategoryIncrease SharingDecrease SharingIncentive“Participants receive a bonus when the amount shared is higher\.”
“Dictators are required to give at least 10% of the endowment\.”
“Dictators are required to give at least 50% of the endowment\.”
“Participants face constrained allocation choices instead of a fully free split\.”“Participants are allowed to keep any amount without allocation restrictions\.”
“Dictators must pay a cost in order to transfer money to the recipient\.”
“Participants receive larger endowments that increase incentives for self\-retention\.”
“The allocation rule emphasizes keeping a larger personal share of the endowment\.”Social Control“Participants are told that their decisions may be reviewed by a social psychologist\.”
“Recipients can write a positive review about the dictator after observing the decision\.”
“The dictator’s allocation decision is revealed to the group\.”“Allocation decisions are made in a virtual environment with reduced social presence\.”
“The recipient does not directly observe the dictator’s allocation decision\.”
“Participants make decisions anonymously without social feedback\.”
“The dictator’s choice is hidden before the other player is informed\.”Distributive Concerns“Recipients are able to reciprocate after receiving the allocation\.”
“Recipients are explicitly told they can keep the received amount\.”
“Dictators are allowed to choose the recipient of the allocation\.”
“Recipients are expected to repay part of the received amount later\.”“Recipients are required to repay part of the received amount later\.”
“The allocation is framed as belonging primarily to the dictator\.”
“The recipient’s payoff depends on the dictator’s retained amount\.”Framing“The received amount can be used to purchase a good for the recipient\.”
“The dictator is paid based on the value of the purchased good\.”“The allocation task is framed using self\-interested or ownership\-oriented language\.”
“Participants are encouraged to focus on maximizing their personal payoff\.”Social Distance“The recipient is a previous dictator from the same group\.”
“The recipient is described as a close friend of the dictator\.”“The recipient is described as a stranger with no social connection to the dictator\.”
“Participants are told they are unlikely to interact with the recipient again\.”Demographic and Cultural–“The experiment is conducted in a setting where sharing is not a strong social norm\.”
“Recipients are described as financially independent and already employed\.”
## Appendix DAnalysis ofBe\.FM\-1\.5\-4B’s Regression on IEO Task
As discussed in Section[5\.4](https://arxiv.org/html/2606.24162#S5.SS4),Be\.FM\-1\.5\-4Bregresses after fine\-tuning on the IEO economics contest from 73\.9% \(Qwen3\-4B\) to 53\.3%\. One possible reason is a loss of reasoning capability on questions that require step\-by\-step derivation\. To test this hypothesis, we split the 124 IEO questions into two sets: reasoning items and non\-reasoning items, by whether Qwen3\-4B’s response includes a reasoning trace with more than 10 tokens \(Table[7](https://arxiv.org/html/2606.24162#A4.T7)\)\. The two models are within 6 points on non\-reasoning items \(70\.4% vs\. 64\.6%\) but 53 points apart on reasoning items \(81\.0% vs\. 27\.8%\)\. Aggregated across five evaluation runs, Qwen3\-4B outputs 447 tokens per answer on average, whileBe\.FM\-1\.5\-4Baverages 1\.3 tokens and outputs only the final letter on every item\. Supervised fine\-tuning on terse behavioral targets has compressed the 4B output format and suppressed the multi\-step computation its backbone uses for numerical questions\. Closing the gap at 4B will require training that preserves chain\-of\-thought where it is needed, which we leave to future versions ofBe\.FM\-1\.5\.
Table 7:IEO questions split by whether Qwen3\-4B’s response contains a reasoning trace \(\>\>10 tokens\)\. TheBe\.FM\-1\.5\-4Bregression concentrates on questions that require multi\-step reasoning\.Question SetNQwen3\-4BBe\.FM\-1\.5\-4BGapAll12473\.9%53\.3%−\-20\.6%Reasoning \(\>\>10 tok\)4181\.0%27\.8%−\-53\.2%Non\-reasoning8370\.4%64\.6%−\-5\.8%##### Case study: wage calculation\.
We present an IEO question where Qwen3\-4B succeeds in answering the question, butBe\.FM\-1\.5\-4Bfails\. We show the question and model outputs below \(Boxes[D](https://arxiv.org/html/2606.24162#A4.SS0.SSS0.Px1)–[D](https://arxiv.org/html/2606.24162#A4.SS0.SSS0.Px1)\)\. Qwen3\-4B outputs a long reasoning trace \(485 tokens\) and correctly solves this question with a step\-by\-step calculation, whileBe\.FM\-1\.5\-4Bdirectly outputs a wrong prediction\.
System\.You are an expert in economics\. Answer the following multiple\-choice question by selecting the correct option\(s\)\. There may be one or more correct answers\. Output the letter\(s\) of the correct answer\(s\) \(e\.g\., A or BC\)\. Provide only the letter\(s\) of your answer\.User\.Currently you work for 40 hours per week at the wage rate of $20 an hour\. Your free hours are defined as the number of hours not spent in work per week, which in this case is24hours×7days−40hours=12824\\text\{ hours\}\\times 7\\text\{ days\}\-40\\text\{ hours\}=128hours per week\. Suppose now that your wage rate has increased by 25%\. If you are happy to keep your total weekly income constant, then:A\.Your total number of working hours per week will fall by 25%\.B\.Your total number of working hours per week will be 30 hours\.C\.Your total number of free hours per week will increase by 25%\.D\.Your total number of free hours per week will increase by 6\.25%\.Ground truth: D\.
We are given:•Current weekly working hours: 40 hours•Current wage rate: $20/hour•Free hours==\(2424hours/day×\\times77days\)−\-4040==168−40=128168\-40=128hours per week•Wage increases by 25%, so new wage rate==$20×\\times1\.251\.25==$25/hour•You want tokeep total weekly income constantStep 1: Calculate current weekly income Current income==4040hours×\\times$20==$800Step 2: After wage increase, keep income at $800 New wage==$25/hour\. So, new working hours to maintain $800 income:New hours=80025=32hours per week\\text\{New hours\}=\\frac\{800\}\{25\}=32\\text\{ hours per week\}Step 3: Compare working hours Original:4040hours\. New:3232hours\. So, working hoursdecreaseby:40−3240=840=0\.20=20%\\frac\{40\-32\}\{40\}=\\frac\{8\}\{40\}=0\.20=20\\%So,A is false— it says working hours fall by25%25\\%, but it’s only20%20\\%\.Step 4: Check option B B: “Your total number of working hours per week will be 30 hours\.” We calculated it as3232hours→\\rightarrowFalse\.Step 5: Free hours Original free hours==128128hours\. New working hours==3232→\\rightarrownew free hours==168−32=136168\-32=136hours\. Increase in free hours==136−128=8136\-128=8hours\. Percentage increase:8128=0\.0625=6\.25%\\frac\{8\}\{128\}=0\.0625=6\.25\\%So,D is correct: Free hours increase by6\.25%6\.25\\%\. C says: “Free hours increase by25%25\\%”→\\rightarrowfalse\.Final Answer:OnlyDis correct\.D
B
## Appendix EPer\-game Results for Behavior Prediction and Simulation
This appendix reports the per\-game results for the three economic\-game tasks summarized in the main results table \(Table[3](https://arxiv.org/html/2606.24162#S5.T3)\): single\-round game behavior simulation, multi\-round behavior prediction, and across\-game single\-round prediction\. For each task we report distributional alignment with Wasserstein distance \(WW, lower is better\) and individual\-level prediction with mean absolute error \(MAE, lower is better\)\. The single\-round simulation results appear in Table[8](https://arxiv.org/html/2606.24162#A5.T8), the multi\-round prediction results in Tables[9](https://arxiv.org/html/2606.24162#A5.T9)and[10](https://arxiv.org/html/2606.24162#A5.T10), and the across\-game prediction results in Tables[11](https://arxiv.org/html/2606.24162#A5.T11)and[12](https://arxiv.org/html/2606.24162#A5.T12)\.
### E\.1Economic Games
ModelDictatorProposerResponderInvestorBankerPub\.GoodsBombBeauty ContestPush/PullAvgQwen3\-4B21\.341\.033\.629\.023\.222\.818\.933\.918\.226\.9Llama3\.3\-70B20\.114\.833\.421\.330\.218\.818\.615\.345\.324\.2DeepSeek V3\.213\.78\.07\.627\.118\.922\.917\.718\.018\.016\.9Claude Haiku 4\.518\.76\.928\.425\.229\.422\.917\.726\.644\.524\.5Claude Sonnet 4\.625\.813\.618\.730\.819\.119\.517\.719\.557\.024\.6Claude Opus 4\.625\.813\.323\.833\.319\.223\.617\.720\.457\.026\.0GPT 5\.4 mini25\.89\.530\.522\.036\.942\.017\.734\.942\.029\.0GPT 4\.125\.816\.532\.447\.119\.621\.315\.425\.144\.527\.5GPT 5\.425\.812\.332\.437\.937\.141\.817\.735\.542\.031\.4Gemini 3\.1 Flash25\.812\.014\.957\.822\.853\.717\.722\.554\.031\.3Gemini 3\.1 Pro25\.811\.132\.425\.029\.516\.917\.718\.615\.021\.3Socrates\-Llama3\-8B\-SFT26\.67\.316\.15\.55\.69\.09\.610\.257\.0†16\.3Socrates\-Llama3\-8B\-DPO25\.812\.018\.133\.622\.925\.117\.625\.357\.026\.4Socrates\-Qwen\-14B\-SFT19\.94\.416\.716\.110\.111\.610\.415\.357\.0†17\.9Socrates\-Qwen\-14B\-DPO25\.819\.714\.737\.522\.924\.617\.714\.03\.620\.0Centaur\-70B26\.24\.85\.719\.329\.919\.05\.615\.726\.016\.9Be\.FM\-1\-8B4\.57\.44\.67\.96\.47\.23\.17\.430\.88\.8Be\.FM\-1\-70B3\.56\.54\.97\.75\.37\.84\.37\.330\.88\.7\[1pt/2pt\] Be\.FM\-1\.5\-4B3\.74\.04\.515\.78\.18\.74\.16\.28\.67\.0Be\.FM\-1\.5\-70B7\.16\.95\.814\.14\.75\.37\.26\.123\.18\.9
Table 8:Single\-round game behavior simulation: Wasserstein distance \(↓\\downarrow\)\.†The model produced no parseable Push/Pull response, so its Push/PullWWis set to the worst observed value in this column\.ModelDictatorInvestorBanker\_inv50Banker\_inv100Pub\.GoodsBombBeauty ContestPush/PullAvgQwen3\-4B13\.522\.219\.37\.715\.013\.69\.24\.013\.1Llama3\.3\-70B21\.812\.627\.313\.815\.012\.07\.312\.815\.3DeepSeek V3\.214\.26\.118\.011\.45\.512\.79\.93\.010\.1Claude Haiku 4\.513\.112\.327\.912\.79\.417\.07\.30\.012\.5Claude Sonnet 4\.65\.98\.316\.99\.417\.517\.28\.70\.810\.6Claude Opus 4\.68\.59\.819\.39\.59\.817\.38\.610\.411\.7GPT 5\.4 mini3\.022\.39\.44\.441\.117\.28\.734\.617\.6GPT 4\.12\.52\.315\.18\.48\.45\.08\.15\.66\.9GPT 5\.42\.117\.39\.72\.534\.717\.29\.428\.615\.2Gemini 3\.1 Flash8\.627\.219\.19\.17\.416\.110\.350\.818\.6Gemini 3\.1 Pro1\.614\.710\.97\.124\.416\.49\.715\.012\.5Socrates\-Llama3\-8B\-SFT15\.615\.89\.129\.319\.611\.319\.050\.8†21\.3Socrates\-Llama3\-8B\-DPO26\.512\.022\.514\.021\.49\.411\.640\.119\.7Socrates\-Qwen\-14B\-SFT18\.55\.56\.911\.512\.86\.814\.050\.8†15\.9Socrates\-Qwen\-14B\-DPO14\.515\.220\.514\.110\.87\.86\.626\.014\.4Centaur\-70B15\.32\.612\.38\.111\.55\.06\.827\.011\.1Be\.FM\-1\-8B5\.75\.83\.322\.64\.47\.812\.736\.612\.4Be\.FM\-1\-70B4\.56\.04\.220\.84\.06\.010\.113\.28\.6\[1pt/2pt\] Be\.FM\-1\.5\-4B1\.93\.39\.44\.14\.33\.13\.915\.55\.7Be\.FM\-1\.5\-70B4\.23\.94\.82\.31\.82\.95\.13\.13\.5
Table 9:Multi\-round game behavior prediction: Wasserstein distance \(↓\\downarrow\)\.†The model produced no parseable Push/Pull response, so its Push/PullWWis set to the worst observed value in this column\.ModelDictatorInvestorBanker\_inv50Banker\_inv100Pub\.GoodsBombBeauty ContestPush/PullAvgQwen3\-4B18\.730\.023\.517\.020\.917\.811\.434\.021\.7Llama3\.3\-70B23\.835\.428\.818\.426\.618\.511\.153\.627\.0DeepSeek V3\.220\.225\.024\.916\.022\.517\.411\.629\.420\.9Claude Haiku 4\.516\.924\.829\.517\.720\.217\.310\.531\.221\.0Claude Sonnet 4\.611\.822\.720\.713\.224\.517\.211\.028\.818\.7Claude Opus 4\.613\.022\.722\.813\.221\.717\.311\.032\.419\.3GPT 5\.4 mini10\.636\.817\.416\.842\.517\.410\.937\.023\.7GPT 4\.110\.121\.021\.315\.018\.316\.310\.632\.418\.1GPT 5\.49\.433\.016\.913\.736\.417\.211\.231\.421\.1Gemini 3\.1 Flash14\.236\.324\.214\.126\.217\.911\.759\.225\.5Gemini 3\.1 Pro9\.731\.319\.213\.729\.716\.811\.523\.419\.4Socrates\-Llama3\-8B\-SFT25\.937\.628\.136\.932\.128\.127\.7100\.0†39\.5Socrates\-Llama3\-8B\-DPO27\.926\.523\.917\.826\.918\.418\.757\.827\.3Socrates\-Qwen\-14B\-SFT27\.933\.026\.127\.726\.522\.523\.7100\.0†35\.9Socrates\-Qwen\-14B\-DPO18\.426\.425\.117\.019\.516\.510\.030\.820\.5Centaur\-70B19\.823\.526\.622\.624\.719\.015\.534\.623\.3Be\.FM\-1\-8B24\.041\.827\.634\.135\.226\.724\.559\.034\.1Be\.FM\-1\-70B21\.842\.827\.731\.036\.026\.420\.849\.632\.0\[1pt/2pt\] Be\.FM\-1\.5\-4B11\.321\.520\.714\.318\.817\.111\.232\.218\.4Be\.FM\-1\.5\-70B12\.320\.817\.915\.419\.117\.210\.730\.918\.0
Table 10:Multi\-round game behavior prediction: MAE \(↓\\downarrow\)\.†The model produced no parseable Push/Pull response, all such cases are counted as incorrect predictions\.ModelDictatorProposerResponderInvestorBankerPub\.GoodsBombBeauty ContestPush/PullAvgQwen3\-4B16\.321\.728\.427\.119\.722\.017\.718\.39\.520\.1Llama3\.3\-70B14\.59\.510\.420\.712\.416\.716\.814\.018\.714\.9DeepSeek V3\.216\.25\.49\.811\.416\.611\.110\.017\.312\.912\.3Claude Haiku 4\.514\.48\.411\.322\.625\.714\.316\.216\.626\.417\.3Claude Sonnet 4\.69\.99\.612\.126\.014\.814\.417\.319\.050\.719\.3Claude Opus 4\.611\.47\.916\.526\.816\.612\.016\.718\.754\.320\.1GPT 5\.4 mini6\.510\.44\.411\.320\.124\.217\.424\.043\.318\.0GPT 4\.115\.49\.411\.016\.616\.018\.410\.318\.330\.916\.3GPT 5\.45\.98\.45\.210\.46\.014\.917\.421\.137\.214\.0Gemini 3\.1 Flash19\.97\.114\.311\.914\.712\.413\.620\.541\.117\.3Gemini 3\.1 Pro5\.27\.48\.99\.78\.07\.913\.516\.715\.310\.3Socrates\-Llama3\-8B\-SFT21\.86\.48\.37\.97\.68\.18\.38\.754\.3†14\.6Socrates\-Llama3\-8B\-DPO23\.77\.55\.623\.721\.126\.19\.828\.146\.921\.4Socrates\-Qwen\-14B\-SFT20\.84\.88\.86\.74\.011\.19\.013\.154\.3†14\.7Socrates\-Qwen\-14B\-DPO20\.13\.511\.831\.619\.521\.417\.715\.318\.817\.7Centaur\-70B21\.42\.97\.611\.414\.417\.25\.319\.824\.813\.9Be\.FM\-1\-8B4\.34\.93\.87\.54\.87\.96\.38\.11\.25\.4Be\.FM\-1\-70B5\.34\.42\.37\.04\.13\.93\.48\.828\.87\.6\[1pt/2pt\] Be\.FM\-1\.5\-4B2\.52\.12\.816\.88\.37\.95\.54\.23\.45\.9Be\.FM\-1\.5\-70B6\.72\.63\.014\.94\.95\.45\.45\.110\.86\.5
Table 11:Single\-round game behavior prediction given observations from other games: Wasserstein distance \(↓\\downarrow\)\.†The model produced no parseable Push/Pull response, so its Push/PullWWis set to the worst observed value in this column\.ModelDictatorProposerResponderInvestorBankerPub\.GoodsBombBeauty ContestPush/PullAvgQwen3\-4B22\.024\.429\.932\.726\.224\.717\.726\.147\.327\.9Llama3\.3\-70B24\.415\.018\.336\.726\.925\.718\.223\.448\.826\.4DeepSeek V3\.224\.313\.616\.540\.224\.530\.221\.923\.152\.927\.5Claude Haiku 4\.521\.814\.215\.935\.429\.325\.618\.122\.451\.526\.0Claude Sonnet 4\.620\.713\.115\.735\.922\.826\.517\.919\.855\.225\.3Claude Opus 4\.619\.412\.319\.035\.424\.427\.718\.220\.256\.125\.9GPT 5\.4 mini23\.315\.513\.838\.529\.037\.217\.927\.344\.127\.4GPT 4\.122\.615\.415\.439\.323\.929\.226\.220\.251\.527\.1GPT 5\.422\.514\.114\.737\.225\.532\.517\.925\.944\.726\.1Gemini 3\.1 Flash24\.814\.020\.045\.523\.937\.221\.724\.053\.629\.4Gemini 3\.1 Pro23\.814\.117\.939\.125\.332\.219\.321\.647\.126\.7Socrates\-Llama3\-8B\-SFT28\.723\.425\.539\.624\.839\.630\.826\.8100\.0†37\.7Socrates\-Llama3\-8B\-DPO25\.014\.616\.534\.021\.728\.221\.730\.756\.827\.7Socrates\-Qwen\-14B\-SFT29\.722\.824\.342\.226\.430\.127\.826\.6100\.0†36\.7Socrates\-Qwen\-14B\-DPO23\.716\.016\.534\.422\.924\.317\.719\.748\.124\.8Centaur\-70B26\.016\.315\.837\.530\.334\.829\.527\.945\.629\.3Be\.FM\-1\-8B24\.319\.520\.244\.328\.338\.226\.327\.889\.635\.4Be\.FM\-1\-70B24\.816\.921\.544\.628\.135\.026\.226\.790\.134\.9\[1pt/2pt\] Be\.FM\-1\.5\-4B22\.913\.314\.044\.128\.030\.623\.321\.450\.227\.5Be\.FM\-1\.5\-70B23\.813\.513\.545\.827\.032\.823\.020\.850\.027\.8
Table 12:Single\-round game behavior prediction given observations from other games: MAE \(↓\\downarrow\)\.†The model produced no parseable Push/Pull response, all such cases are counted as incorrect predictions\.
## Appendix FDistributions of Model Outputs
This section visualizes the model\-predicted output distribution alongside the test\-set distribution\. The plots cover single\-round game behavior simulation \(Figures[3](https://arxiv.org/html/2606.24162#A6.F3)and[4](https://arxiv.org/html/2606.24162#A6.F4)\), multi\-round game behavior prediction \(Figures[5](https://arxiv.org/html/2606.24162#A6.F5)and[6](https://arxiv.org/html/2606.24162#A6.F6)\), and across\-game single\-round prediction \(Figures[7](https://arxiv.org/html/2606.24162#A6.F7)and[8](https://arxiv.org/html/2606.24162#A6.F8)\)\. The top row of each figure shows the test\-set distribution, and the remaining rows show the per\-model predicted distributions\.
### F\.1Single\-round Game Behavior Simulation
Figure 3:Distribution of model outputs in single\-round game behavior simulation \(Part 1\)\.Figure 4:Distribution of model outputs in single\-round game behavior simulation \(Part 2\)\.
### F\.2Multi\-round Game Behavior Prediction
Figure 5:Distribution of model outputs in multi\-round game behavior prediction \(Part 1\)\.Figure 6:Distribution of model outputs in multi\-round game behavior prediction \(Part 2\)\.
### F\.3Single\-round Game Behavior Prediction Given Observations from Other Games
Figure 7:Distribution of model outputs in single\-round game behavior prediction given observations from other games \(Part 1\)\.Figure 8:Distribution of model outputs in single\-round game behavior prediction given observations from other games \(Part 2\)\.
## Appendix GPrompt Templates
This appendix documents all prompt templates used in the experiments\. For each task, we show the system prompt and user prompt template\. Placeholders are shown as \{variable\}\. Section headers within prompts \(originally in Markdown\) are rendered inbold\.
### A\. Economic Games — Single\-round Behavior Simulation
All single\-round economic game prompts share the following system prompt\.
##### System prompt\.
You are a player who is playing an economics game\.
##### Dictator Game\.
You are paired with another player\. Your role is to decide how to divide $100 and the other player simply receives your choice\. How would you like to divide the money? Please give only one concrete choice and highlight the amount you give to the other player in \[\] \(such as \[$x\]\)\.
##### Ultimatum Game — Proposer\.
This is a two\-player game\. You are the Proposer, and the other player is the Responder\. As the proposer, you propose how to divide $100 and the Responder chooses either Accept or Reject\. If accepted, the two of you will earn as described by the accepted proposal accordingly\. If rejected, then both of you will earn $0\. How much would you like to propose to give to the Responder? Please give only one concrete choice, and highlight the amount with \[\] \(such as \[$x\]\)\.
##### Ultimatum Game — Responder\.
This is a two\-player game\. You are the Responder, and the other player is the Proposer\. The proposer proposes how to divide $100 and you, as the Responder, choose either Accept or Reject\. If accepted, the two of you will earn as described by the accepted proposal accordingly\. If rejected, then both of you will earn $0\. As the Responder, what is the minimal amount in the proposal that you would accept? Please give only one concrete choice, and highlight the amount with \[\] \(such as \[$x\]\)\.
##### Trust Game — Investor\.
This is a two\-player game\. You are an Investor and the other player is a Banker\. You have $100 to invest and you choose how much of your money to invest with the Banker\. The amount you choose to invest will grow by 3x with the Banker\. For example, if you invest $10, it will grow to $30 with the Banker\. The Banker then decides how much of the money \($0\-$30\) to return to you, the Investor\. How much of the $100 would you like to invest with the Banker? Please give only one concrete choice, and highlight the number with \[\] \(such as \[$x\]\)\.
##### Trust Game — Banker\.
This is a two\-player game\. You are a Banker and the other player is an Investor, and the goal for each player is to earn more\. The Investor chooses how much of the money \(up to $100\) to invest with you\. The amount the Investor invests will generate a 2x return with you \(the current value of investment becomes 3x\)\. To settle the investment, as the Banker, you get to decide how much of this total amount to return to the Investor and the rest will be kept as your profit\. For example, you can choose to return $0 \(therefore the investor will lose their investment\), or you can return the entire 3x \(initial investment \+ 2x profit\) to the investor, or you can return any amount in between\. Now, if the investor has invested $50 with you and the current value became $150, how much of the $150 would you like to return to the Investor? Please give only one concrete choice, and highlight the number with \[\] \(such as \[$x\]\)\.
##### Public Goods Game\.
In this public good game, you and 3 others will choose how much to contribute towards a water cleaning project\. Each of you gets $20 per round to contribute between $0 and $20\. The project has a 50% return rate\. Your payoff relies on both your and others’ contributions\. Everyone benefits from the group’s total contribution\. Your payoff in each round equals the amount you didn’t contribute \(endowment \- contribution\) plus everyone’s benefit \(total contributions \* 50% return rate\)\. Here are two examples to calculate your payoff\.Example one: You contributed $12; total group contributions were $20Your Payoff = \($20\-$12\) \+ $20\*50% = $18Example two: You contributed $12; total group contributions were $30Your Payoff = \($20\-$12\) \+ $30\*50% = $23We will play a total of 3 rounds, in the first round, how much of the $20 would you like to contribute? Please give a concrete number and highlight it with \[\] \(e\.g\., \[x\]\)\.
##### Bomb Risk Game\.
There are 100 boxes, and one bomb has been randomly placed in 1 of 100 boxes\. You can choose to open 0\-100 boxes at the same time\. If none of the boxes you open has the bomb, you earn points that are equal to the number of boxes you open\. If one of the boxes you open has the bomb, you earn zero points\. How many boxes would you open? Please give one concrete number and highlight it with \[\] \(such as \[x\]\)\.
##### Beauty Contest \(Guessing\) Game\.
This is a number\-guessing game where you will play with \{num\_players\} other players\. Each player chooses an integer between 0 and 100 which is hidden from other players\. The player who chooses a number that is closest to 2/3 of the average wins the game\. For example, if the average of the guesses is 60 and if you guess 40, i\.e\. 2/3 of 60, you win the game\. What number would you like to guess? Please give a concrete number and highlight it with \[\] \(e\.g\., \[x\]\)\.
##### Push/Pull \(Prisoner’s Dilemma\) Game\.
You’re paired with another player, each having a $400 ‘Push’ card and a $300 ‘Pull’ card\. Your payoff depends on both players’ card choices\. Here are the scenarios:•Both play ‘Push’: Each earns $400•You play ‘Push’, the other player plays ‘Pull’: You earn $0, the other player earns $700•You play ‘Pull’, the other player plays ‘Push’: You earn $700, the other player earns $0•Both play ‘Pull’: Each earns $300In 5 total rounds, which card would you play in the first round? Please give a concrete choice and highlight it with \[\] \(i\.e\., \[Push\] or \[Pull\]\)\.Instruction suffix: Important: Answer with \[Push\] or \[Pull\], not a number\.
### B\. Economic Games — Multi\-Round Behavior Prediction
The system prompt is the same as in Section A\. Multi\-round prompts predict a player’s next action given their play history\. The general user prompt template is:
##### General template\.
\{game\_instruction\}Play History In this session, you have already played \{num\_rounds\} round\(s\)\. Here is the play history: \{history\}Your Choice Now in Round \{next\_round\}, what would you choose? Please give only one concrete choice and highlight the number with \[\] \(such as \[$x\]\)\.
Here \{game\_instruction\} is the single\-round prompt from Section A and \{history\} consists of per\-round lines in one of the following formats\.
##### History line formats\.
Without other players \(Dictator, Bomb\): \- Round \{round\_num\}\. Your choice: \[\{your\_action\}\]\.With one other player — singular \(Trust\): \- Round \{round\_num\}\. Your choice: \[\{your\_action\}\], the other player’s choice: \[\{others\_action\}\]\.With multiple other players — plural \(Public Goods\): \- Round \{round\_num\}\. Your choice: \[\{your\_action\}\], the other players’ choices: \[\{others\_action\}\]\.Beauty Contest \(custom\): \- Round \{round\_num\}\. Your choice: \[\{your\_action\}\], the group average is \[\{group\_avg\}\], and 2/3 of that is \[\{target\}\]\. \{round\_outcome\}
##### Trust Banker — variable investment\.
In multi\-round Trust Banker prompts, the investment amount varies per round\. The game rules \(same as Section A\) are followed by the play history and a round\-specific question:Now in this round, the investor has invested $\{investment\} with you and the current value became $\{total\}\. How much of the $\{total\} would you like to return to the Investor? Please give only one concrete choice and highlight the number with \[\] \(such as \[$x\]\)\.
##### Beauty Contest — multi\-round\.
The guessing game uses simplified rules \(without the round count\) and a modified question:Now you will play the game again \(Round \{next\_round\}\)\. What number would you like to guess? Please give a concrete number and highlight it with \[\] \(e\.g\., \[x\]\)\.
##### Push/Pull — multi\-round\.
The Push/Pull game uses \[Push\]/\[Pull\] choices instead of numeric values:Which card would you play in Round \{next\_round\}? Please give a concrete choice and highlight it with \[\] \(i\.e\., \[Push\] or \[Pull\]\)\.Respond with only \[Push\] or \[Pull\]\. Do not explain\.
### C\. Economic Games — Across\-context Single\-round Prediction
The system prompt is the same as in Section A\. Given a player’s history from other games, the model predicts their choice in a new target game\.
##### User prompt template\.
Play History You have played the following economics games before\.\{history\_section\}Your Choice Now you are playing a new game with the game instruction below:\{target\_instruction\}
Each entry in \{history\_section\} is formatted as:
\{index\}\. \{game\_name\} \- Game instruction: \{instruction\} \- Your choice: \[\{choice\}\]
The \{target\_instruction\} is the single\-round prompt from Section A for the target game\.
### D\. Big Five Personality Survey
##### Task 1: Personality Score from Demographics\.
System:You are an expert in psychology\. Given a person’s demographics, your task is to predict this person’s BigFive dimensionality scores\.User: Demographics \{demographics\}BigFive Dimensionality Scores Each dimensionality score ranges from 10 to 50, with 10 indicating the lowest score in that dimension and 50 indicating the highest score\.Output Format Based on this person’s demographics, please estimate this person’s personality score in the \*\{dimension\_name\}\* dimension\. Please output a single number in the range from 10 to 50, highlighted in \[\] \(e\.g\., \[x\]\)\.
\{demographics\} is formatted as: “A \{age\}\-year\-old \{gender\} from \{country\}\. \{hand\}\-handed\. The race is \{race\}\. The native language is \{engnat\}\.” \{dimension\_name\} is one of: Extraversion, Neuroticism, Agreeableness, Conscientiousness, Openness\.
##### Task 2: Survey Response Prediction Given Demographics\.
System:You are a \{age\}\-year\-old \{gender\} from \{country\}\. You are \{hand\}\-handed\. Your race is \{race\}\. Your native language is \{engnat\}\.User:The following item was rated on a five\-point scale where 1=Disagree, 2=Partially Disagree, 3=Neutral, 4=Partially Agree, 5=Agree\. Please select how this statement describes you and highlight your answer in \[\]\(such as \[1\],\[2\],\[3\],\[4\],or \[5\]\): \{question\} Only output your answer in brackets\.
##### Task 3: Age Prediction from Personality\.
System:You are an expert in psychology\. Given a person’s BigFive personality test dimensionality scores, your task is to predict this person’s age\.User: BigFive Dimensionality Scores Each dimensionality score ranges from 10 to 50, with 10 indicating the lowest score in that dimension and 50 indicating the highest score\.•Openness: \{O\_score\}•Conscientiousness: \{C\_score\}•Extraversion: \{E\_score\}•Agreeableness: \{A\_score\}•Neuroticism: \{N\_score\}Output Format Please predict this person’s age\. Please output a single number indicating the age highlighted with \[\] \(e\.g\., \[x\]\)\. Only output your answer in brackets\.
##### Task 4: Sequential Survey Response Prediction\.
System:You are an expert in psychology\. Given a subject’s previous answers in a Big Five personality test \(in the order they answered them\), your task is to predict this subject’s answer to the next question\.User: Subject’s Previous Answers The following items were rated on a five point scale where 1=Disagree, 2=Slightly Disagree, 3=Neutral, 4=Slightly Agree, 5=Agree\.\{previous\_answers\}The Next Question \{next\_question\}Output Format Please predict the subject’s answer to the next question and highlight your prediction in \[\]\(such as \[1\],\[2\],\[3\],\[4\],or \[5\]\)\. Only output your answer in brackets\.
Items are numbered by original survey position \(1–50\); no item codes are shown\.
##### Task 5: Masked Survey Response Prediction\.
System:You are an expert in psychology\. Given a subject’s answers to 9 questions measuring the same Big Five personality dimension, your task is to predict this subject’s answer to the remaining question in that dimension\.User: Subject’s Answers \(\{dimension\_name\} Dimension\) The following items were rated on a five point scale where 1=Disagree, 2=Slightly Disagree, 3=Neutral, 4=Slightly Agree, 5=Agree\.\{context\_answers\}The Remaining Question \{target\_question\}Output Format Please predict the subject’s answer to the remaining question and highlight your prediction in \[\]\(such as \[1\],\[2\],\[3\],\[4\],or \[5\]\)\. Only output your answer in brackets\.
\{context\_answers\} lists 9 items numbered 1–9 \(no item codes\)\.
##### Task 6: Across\-Dimension Personality Prediction\.
System:You are an expert in psychology\. Given a person’s BigFive personality test scores in four dimensions, your task is to predict the score in the remaining dimension\.User: BigFive Dimensionality Scores Each dimensionality score ranges from 10 to 50, with 10 indicating the lowest score in that dimension and 50 indicating the highest score\.\{dimension\_scores\}Output Format Based on the four dimension scores above, please predict this person’s score in the \*\{target\_name\}\* dimension\. Please output a single number in the range from 10 to 50, highlighted in \[\] \(e\.g\., \[x\]\)\. Only output your answer in brackets\.
\{dimension\_scores\} lists 4 of the 5 dimensions \(Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism\) with their scores\.
### E\. Scientific Workflow Prediction
##### System prompt\.
You are an expert in research tasked with generating detailed prompts for various aspects of academic research papers\. Each task involves creating a specific type of prompt based on the provided information\. Here are the definitions of each part you will work with:•Context:The status quo of related literature or reality which motivated this study\. This could normally be a problem, a research question, or a research gap that has not been successfully addressed by previous work\. This is anything that happened before this study\.•Key Idea:The main intellectual merit of this paper, often in comparison to the context\. This could normally be a novel idea or solution proposed in this paper that distinguishes it from what’s already done in literature\. This is proposed in this study\.•Method:The specific research method that investigates and validates the key idea\. This could be an experimental setup, a theoretical framework, or other necessary methodology to implement and/or evaluate the key idea\. This is performed in this study\.•Outcome:The factual statement about the study output\. This could be the experiment results and any other measurable outcome that has occurred\. It marks whether the key hypothesis is testified or not\. This is produced in this study\.•Projected Impact:The author\-anticipated impact of the work on the field, and potential further research identified by the author that may improve or extend this study\. This is anything being anticipated but has not happened yet\.
##### Task user prompt templates\.
The five workflow prediction tasks use the following user prompt templates:
1\. Idea Generation: Given the context: ‘\{context\}’, generate one key idea that could advance this area of study\. Output only the key idea without explanation\.2\. Method Recommendation: Given the context: ‘\{context\}’ and the key idea: ‘\{key\_idea\}’, recommend one method to validate this idea\. Output only the method without explanation\.3\. Outcome Prediction: Based on the context: ‘\{context\}’, the key idea: ‘\{key\_idea\}’, and the method: ‘\{method\}’, predict the outcome of this research\. Output only the outcome without explanation\.4\. Impact Prediction: Based on the context: ‘\{context\}’, the key idea: ‘\{key\_idea\}’, the method: ‘\{method\}’, and the outcome: ‘\{outcome\}’, suggest the projected impact for this research\. Output only the impact without explanation\.5\. Title Prediction: Given the context: ‘\{context\}’, the key idea: ‘\{key\_idea\}’, the method: ‘\{method\}’, the outcome: ‘\{outcome\}’, and the projected impact: ‘\{projected\_impact\}’, predict the title of this research paper\. Output only the title without explanation\.
### F\. IEO Economics Contest
System:You are an expert in economics\. Answer the following multiple\-choice question by selecting the correct option \(A, B, C, or D\)\. Provide only the letter of your answer\.User:\{question\}A\. \{option\_A\} B\. \{option\_B\} C\. \{option\_C\} D\. \{option\_D\}
### G\. Contextual Factor Inference — Dictator Game
##### System prompt\.
You are an expert in behavioral economics\.
##### User prompt template\.
Given the observation of an experiment on the Dictator game, your task is to infer what experiment designs could lead to the observed treatment effect\. In the Dictator game, given an endowment of money, one player \(the dictator\) chooses how much of the money to keep and how much to give to a second player\.In an economic experiment of the Dictator game, we observed the subject behaviors – the proportion of money to share – \{direction\} compared to the standard game design\. Please list possible experiment designs that could lead to this treatment effect\. Please rank output experiment designs by confidence\.
\{direction\} is one of:increased,decreased\.
## Appendix HLicenses for Existing Assets
BehaviorBenchis constructed from the publicly available data sources described in Section[A\.1](https://arxiv.org/html/2606.24162#A1.SS1)\. We summarize the licensing and terms\-of\-use status of each below\.
##### MobLab economic\-games data\.
##### Big Five Personality Test\.
##### Scientific articles \(AER, NHB\)\.
We use only the titles and abstracts of articles published in the*American Economic Review*\(American Economic Association\) and*Nature Human Behaviour*\(Springer Nature\)\. All articles are properly cited, and titles and abstracts are used for non\-commercial research purposes consistent with standard fair\-use practice for bibliographic data\.
##### International Economics Olympiad\.
Past IEO multiple\-choice problems are publicly distributed for educational use \([https://ieo\-official\.org/](https://ieo-official.org/)\)\. We use them only for non\-commercial research evaluation\.Similar Articles
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
BehaviorBench is a benchmark for evaluating personalized decision modeling from real-world behavioral traces, using prediction-market and on-chain records to test belief and trade prediction tasks.
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
A new framework for automated benchmark generation enables fine-grained, comprehensive evaluation of foundation models with lower error rates and richer metadata, as demonstrated on ML, Corporate Finance, and Personal Finance benchmarks.
Benchmark Everything Everywhere All at Once
Introduces Benchmark Agent, a fully autonomous system for creating diverse benchmarks with minimal human intervention, enabling continuous model assessment across domains.
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
This paper introduces FINESSE-Bench, a suite of eight specialized benchmarks with 3,993 questions for hierarchical evaluation of financial competencies in large language models, covering professional certification topics and applied trading tasks.
MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
MedBench v5 is a dynamic, process-oriented benchmark for clinical multimodal models that integrates hallucination detection and stress testing, moving beyond static QA to evaluate reasoning and stability under information-flow stressors.