AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL Papers

Summary

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:37 PM

# AI Coding Agents Can Reproduce Social Science Findings
Source: [https://arxiv.org/html/2606.11447](https://arxiv.org/html/2606.11447)
Meysam Alizadeh University of Oxford &Mohsen Mosleh University of Oxford &Fabrizio Gilardi University of Zurich

###### Abstract

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited\. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly\. Here we introduce SocSci\-Repro\-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non\-reproducible due to missing data, allowing us to isolate agents’ reproduction capacity\. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex\. These reproduction rates considerably exceed those previously reported for general\-purpose LLM\-based agents on comparable reproducibility benchmarks\. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization\. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible\. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing\. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production\.

*K*eywordsAI in Science⋅\\cdotSocial Science⋅\\cdotReproducibility

## 1Introduction

Interest in autonomous artificial intelligence \(AI\) systems capable of assisting in scientific discovery has grown rapidly in recent years\[[1](https://arxiv.org/html/2606.11447#bib.bib1),[2](https://arxiv.org/html/2606.11447#bib.bib2),[3](https://arxiv.org/html/2606.11447#bib.bib3),[4](https://arxiv.org/html/2606.11447#bib.bib4)\], with proposed applications spanning literature synthesis, hypothesis generation, and data analysis\[[5](https://arxiv.org/html/2606.11447#bib.bib5),[6](https://arxiv.org/html/2606.11447#bib.bib6),[7](https://arxiv.org/html/2606.11447#bib.bib7),[8](https://arxiv.org/html/2606.11447#bib.bib8),[4](https://arxiv.org/html/2606.11447#bib.bib4),[9](https://arxiv.org/html/2606.11447#bib.bib9)\]\. Before such systems can meaningfully participate in scientific knowledge production, however, they must first demonstrate the ability to reproduce existing computational results from original data and code\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\]\. Existing studies evaluated general\-purpose large language models \(LLM\) agents, such as AutoGPT, on reproducibility benchmarks, providing initial evidence that these agents struggle to reliably execute end\-to\-end scientific workflows\[[10](https://arxiv.org/html/2606.11447#bib.bib10),[11](https://arxiv.org/html/2606.11447#bib.bib11),[12](https://arxiv.org/html/2606.11447#bib.bib12)\]\. However, the recent introduction of specialized AI coding agents designed to autonomously execute code, manage dependencies, and debug workflows represents a major technological shift, and their performance remains largely untested, particularly in social science, where large\-scale reproducibility evaluations remain limited\.

Computational reproducibility\[[13](https://arxiv.org/html/2606.11447#bib.bib13),[14](https://arxiv.org/html/2606.11447#bib.bib14),[15](https://arxiv.org/html/2606.11447#bib.bib15),[16](https://arxiv.org/html/2606.11447#bib.bib16),[17](https://arxiv.org/html/2606.11447#bib.bib17),[18](https://arxiv.org/html/2606.11447#bib.bib18),[19](https://arxiv.org/html/2606.11447#bib.bib19)\], defined as the ability to reproduce a study’s findings from the original author\-provided data and code\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\], serves as a minimal but necessary benchmark for evaluating whether AI systems can function as dependable participants in scientific knowledge production\. Achieving reproducibility is often challenging even when code and data are available, as failures may arise from undocumented dependencies, version mismatches, operating system differences, or stochastic elements in analytical pipelines\[[20](https://arxiv.org/html/2606.11447#bib.bib20),[21](https://arxiv.org/html/2606.11447#bib.bib21),[22](https://arxiv.org/html/2606.11447#bib.bib22),[23](https://arxiv.org/html/2606.11447#bib.bib23),[10](https://arxiv.org/html/2606.11447#bib.bib10)\]\.

Systematic evaluation of LLMs on computational reproducibility in social science remains limited\[[24](https://arxiv.org/html/2606.11447#bib.bib24),[16](https://arxiv.org/html/2606.11447#bib.bib16),[18](https://arxiv.org/html/2606.11447#bib.bib18),[25](https://arxiv.org/html/2606.11447#bib.bib25),[26](https://arxiv.org/html/2606.11447#bib.bib26),[27](https://arxiv.org/html/2606.11447#bib.bib27),[28](https://arxiv.org/html/2606.11447#bib.bib28),[29](https://arxiv.org/html/2606.11447#bib.bib29)\]\. CORE\-Bench\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\]includes only 28 social science tasks, all drawn from a highly standardized repository \(i\.e\., CodeOcean\[[30](https://arxiv.org/html/2606.11447#bib.bib30)\]\)\. Repro\-Bench\[[12](https://arxiv.org/html/2606.11447#bib.bib12)\], although covering 112 papers, relies on studies from nine economics journals and only three political science journals\[[24](https://arxiv.org/html/2606.11447#bib.bib24)\], leaving out sociology, psychology, and communication\. In addition, Repro\-Bench provides access to original paper PDFs, which may encourage models to rely on textual cues rather than independent analysis, increasing the risk of confirmatory specification search where agents navigate analytical choices to match reported results rather than independently reproducing them\. Its tasks also focus on reproducing all major findings of each paper, blurring the distinction between the technical reproducibility of research artifacts and the ability of AI systems to execute reproduction workflows\. Moreover, the performance of recent AI coding agents on social science tasks has not been examined\.

In this paper, we addresses these challenges by introducingSocSci\-Repro\-Bench, a new benchmark consisting of 54 papers and 221 tasks across four disciplines—political science, sociology, psychology, and communication—spanning 13 substantive domains, five online repositories, and three programming languages \(see Methods\)\. Beyond its breadth,SocSci\-Repro\-Benchdiffers from existing benchmarks in three key ways\. First, to our knowledge, it is the first benchmark built from systematically selected social science papers rather than from pre\-existing datasets originally assembled for other purposes, as is the case for benchmarks such as CORE\-Bench and Repro\-Bench\. Second, although the underlying materials involve randomness, simulations, and stochastic models, it contains only tasks that produced identical results across three manual code executions, allowing us to isolate agents’ ability to reproduce results from issues in the original code itself\. The benchmark also includes a small set of tasks with restricted data access to test whether models can correctly identify reproducibility constraints\. Third, by annotating the research questions underlying each study, it enables evaluation of higher\-level reasoning tasks such as inferring research questions from code and data\.

Using this benchmark, we evaluate the reproducibility performance of two frontier AI coding agents, Claude Code and Codex\. We examine their ability to reproduce published results, infer research questions from replication materials, and respond to contextual information provided through the original paper PDFs\. We further test the susceptibility of coding agents to sycophancy nudging, a form of prompt framing that encourages confirmatory specification search by prioritizing alignment with reported results in the original papers over faithful execution of the supplied code\.

Together, this study provides a systematic evaluation of whether modern AI coding agents can reproduce empirical findings in social science and identifies conditions under which automated reproducibility may fail\. As AI systems become increasingly integrated into scientific workflows, understanding their capabilities and limitations in reproducing existing research will be essential for ensuring the reliability of AI\-assisted science\.

## 2Claude Code and Codex Performance on SocSci\-Repro\-Bench

Before presenting the results, we briefly summarize the experimental setup \(see Methods for more details\)\. Both agents were evaluated on the same benchmark tasks and replication materials within sandboxed environments that restricted external directory access, web search, and limited execution to the provided code and data\. However, agents are allowed to install packages\. All reported results are averages across three independent runs of the full evaluation pipeline\. Although the evaluation framework was identical, the agents differ slightly in their prompt design\. Claude Code autonomously inspects and executes existing codebases while resolving environment issues\. Codex did not consistently exhibit this self\-repair capability in our testing environment and therefore required additional prompt guidance to construct an executable replication script when necessary\. Both agents ran in fully automated mode with no human intervention and no memory of prior runs\.

Because benchmark tasks were constructed only from results that were reproducible with the available materials in their current form, the reported accuracies measure AI coding agents’ ability to reproduce social science results conditional on complete and executable replication materials\. The results should therefore not be interpreted as estimates of the overall reproducibility of the underlying social science literature\.

### 2\.1Reproducibility Results

We compared the computational reproducibility performance of two AI coding agents—Claude Opus 4\.6 \(via Claude Code CLI\) and GPT\-5\.3\-Codex \(via Codex CLI\)—across 54 social science papers, each evaluated over three independent runs \(Fig\.[1](https://arxiv.org/html/2606.11447#S2.F1)\)\. Claude Code substantially outperformed Codex at both the task and paper levels\. At the task level, Claude Code achieved a mean accuracy of 93\.4%, compared with 62\.1% for Codex—a difference of 31\.3 percentage points\. This gap widened at the paper level, where a paper was considered fully reproduced only if all of its constituent tasks were answered correctly: Claude Code attained 78\.0% paper\-level accuracy versus 35\.8% for Codex, a difference of 42\.2 percentage points\. Both agents achieved perfect accuracy \(100%\) on non\-reproducible tasks \(N=10N=10\), correctly identifying all cases where data or code were insufficient for reproduction\. Unlike other tasks in the benchmark, these items require diagnosing the absence of necessary data or code rather than executing statistical analyses\. Accordingly, their interpretation differs from that of standard reproduction tasks\. Performance was consistent across runs for both models, with Claude Code’s task\-level accuracy ranging from 92\.6% to 94\.5% and Codex’s from 58\.4% to 65\.3%, indicating stable and reproducible behavior of the agents themselves\.

Even when excluding tasks where Codex failed entirely \(produced no output\), its task\-level accuracy rises from 62\.1% to only 75\.5%, and paper\-level accuracy from 35\.8% to 49\.2%\. This means that roughly 1 in 4 non\-failed tasks still produced incorrect results, and more than half of non\-failed papers had at least one wrong answer\. For comparison, Claude Code achieves 93\.4% task accuracy and 78\.0% paper accuracy with a 0% failure rate\.

![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/accuracy_fail_comparison.png)Figure 1:Comparison of Claude Code and Codex across three accuracy metrics and failure rates\.\(Left\) Accuracy for all tasks \(N = 221\), non\-reproducible tasks \(N = 10\), and all papers \(N = 54\)\. Both models achieve perfect accuracy on non\-reproducible tasks, while Claude Code substantially outperforms Codex at both task \(93\.4% vs\. 62\.1%\) and paper level \(78\.0% vs\. 35\.8%\), where a paper was considered fully reproduced only if all of its constituent tasks were answered correctly\. \(Right\) Failure rates across all tasks and papers, defined as cases where the code fails to complete or produce the expected output due to an error or unmet requirement the agent cannot resolve\. Claude Code produces no failures across all runs, whereas Codex exhibits failure rates of 17\.8% at the task level and 27\.0% at the paper level\. Values shown above bars are mean percentages over three runs rounded to one decimal place\.#### Codex Struggles with Non\-Portable Replication Code:

The replication materials were anonymized but otherwise left unchanged, preserving the original code and directory structures\. These archives frequently contained latent executability issues such as missing dependencies, hardcoded file paths, and incomplete environment specifications that required adaptation before successful execution\. Claude Code autonomously resolved such problems in every case, constructing revised, executable replication pipelines without human intervention; by contrast, Codex failed to produce an answer for 17\.8% of tasks and 27\.0% of papers \(right panel in Fig[1](https://arxiv.org/html/2606.11447#S2.F1)\), indicating a limited capacity for self\-repair\. Failure is defined as cases where the code fails to complete or produce the expected output due to an error or unmet requirement the agent cannot resolve\. Common failure modes for Codex included inability to handle missing required R packages and to adapt hardcoded or machine\-specific file paths\. Environment drift including version incompatibilities, notebook kernel constraints, and deprecated APIs further compounded these challenges, as did non\-portable interactive dependencies \(see Table[S1](https://arxiv.org/html/2606.11447#A4.T1)in Appendix for all categories of failures and examples\)\. Claude Code achieved a zero failure rate across all three runs, whereas Codex’s failure rate ranged from 14\.1% to 20\.8% of tasks, underscoring a qualitative difference in the agents’ ability to autonomously resolve infrastructural fragilities\. These results suggest that the primary barrier to automated computational reproducibility may lie not in the analytical logic of replication code but in the brittleness of its execution environment, at least for the agents and task set evaluated here, and that sufficiently capable agents can overcome this barrier without manual remediation\.

#### Perfect Python Performance and Broader Gains for Claude Code:

Figure[2](https://arxiv.org/html/2606.11447#S2.F2)presents the average performance of Claude Code and Codex \(across three runs\) stratified by the primary programming language of each replication package \(panels a, b, e\) and by whether the paper was published before or after each agent’s training\-data cutoff \(panels c, d, f\)\. Claude Code consistently outperformed Codex across all strata\. At the task level \(panel a\), Claude Code achieved near\-ceiling average accuracy for Python \(100%\), Stata \(94\.4%\), and R \(91\.9%\), whereas Codex accuracy was substantially lower and more variable, ranging from 40\.0% for Python to 69\.1% for R\. However, because the benchmark contains unequal numbers of tasks across languages \(Python n = 49, R n = 136, Stata n = 36\), these results reflect the composition of the benchmark rather than controlled comparisons of language difficulty\.

This gap widened at the paper level \(panel b\), where a single incorrect task renders the entire paper incorrect: Claude Code fully reproduced 100% of Python papers, 75% of R papers, and 71\.4% of Stata papers, compared with 25%, 41\.7%, and 28\.6% for Codex, respectively\. Codex’s lower accuracy was driven in part by outright execution failures \(panel e\)—tasks for which the agent failed to execute the code\. Codex exhibited the highest failure rate on Stata tasks \(38\.9%\), followed by Python \(25%\) and R \(9\.6%\), suggesting that it struggled most with languages that require translation to an executable environment or that have smaller representation in training corpora\. Claude Code, by contrast, recorded zero task failures across all three languages\.

Stratification by training\-data cutoff \(panels c, d, f\) revealed that neither agent’s performance differed meaningfully between papers published before versus after its knowledge cutoff \(Claude Code: April 2025; Codex: May 2024\)\. Claude Code’s task accuracy was 93\.3% pre\-cutoff and 96\.2% post\-cutoff; Codex showed a similarly flat pattern \(62\.9% versus 62\.5%\)\. Repeating the analysis using preprint appearance dates instead of official publication dates \(not reported\) yields a similar pattern\. The same stability held at the paper level \(panel d\) and for failure rates \(panel f\), where Codex’s failure rate was 18\.2% in both periods\. These results suggest that data contamination\[[31](https://arxiv.org/html/2606.11447#bib.bib31)\]—the possibility that agents succeed simply because they have memorized published results—is unlikely to explain the observed performance differences\. The findings are more consistent with genuine differences in code comprehension, environment setup, and execution capabilities between the two agents, though we cannot fully rule out other explanations\.

\(a\)Task Accuracy by Language![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_task_accuracy_language.png)
\(b\)Paper Accuracy by Language![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_paper_accuracy_language.png)
\(c\)Task Accuracy by Training Cutoff![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_task_accuracy_cutoff.png)
\(d\)Paper Accuracy by Training Cutoff![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_paper_accuracy_cutoff.png)
\(e\)Task Failure by Language![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_task_fail_language.png)
\(f\)Task Failure by Training Cutoff![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/strat_task_fail_cutoff.png)

Figure 2:Stratified performance of Claude Code and Codex across programming languages and training\-data cutoffs\.a, Task\-level accuracy stratified by the primary programming language of each replication package \(Python,n=49n=49tasks; R,n=136n=136; Stata,n=36n=36\)\. Sample sizes differ across languages and comparisons are descriptive\.b, Paper\-level accuracy \(all tasks correct\) under the same language stratification \(Python,n=4n=4papers; R,n=36n=36; Python and R,n=7n=7; Stata,n=7n=7\)\. The other 7 papers were multi\-language packages and not shown in this plot\.c, Task\-level accuracy stratified by whether the paper was published before or after each agent’s training\-data cutoff \(Claude Code: April 2025; Codex: May 2024\)\.d, Paper\-level accuracy under the same cutoff stratification\.e, Task\-level failure rate by language\.f, Task\-level failure rate by training\-data cutoff\. Claude Code \(orange\) achieved100%100\\%accuracy in Python tasks\. Neither agent showed a meaningful difference in performance between pre\- and post\-cutoff papers\. Values shown above bars are mean percentages over three runs rounded to one decimal place\.
#### Paper Access Improves Accuracy:

To assess whether contextual knowledge of a study’s objectives and expected outputs improves automated reproducibility, we repeated our evaluation pipeline with the original paper PDF appended to each anonymized replication package\. Across all 221 tasks, providing PDFs yielded modest accuracy gains for both agents: Claude Code improved from 93\.4% to 94\.5%, while Codex rose from 62\.1% to 65\.4% \(Fig\.[S1](https://arxiv.org/html/2606.11447#A6.F1)in Appendix\)\. Paper\-level accuracy followed a similar trend \(Claude Code: 78\.0% to 80\.4%; Codex: 35\.8% to 41\.4%\)\. The benefits were most pronounced for Codex’s failure rates, which fell from 17\.8% to 12\.2% at the task level and from 27\.0% to 5\.6% at the paper level, consistent with the possibility that the weaker model used in information in the PDF to resolve ambiguities in file structure, execution order, or dependency configuration that would otherwise block the pipeline entirely\. Claude Code, by contrast, maintained zero failures in both conditions\. Critically, however, PDF access introduced a systematic bias on non\-reproducible tasks: those for which data restrictions or missing code make execution impossible and the correct answer is an explicit indication of non\-reproducibility\. On these tasks, accuracy dropped from 100\.0% to 63\.3% for Claude Code and from 100% to 90\.0% for Codex, indicating that when models can read the paper’s reported results, they tend to extract the expected numerical output rather than correctly diagnosing an execution failure\. This trade\-off highlights a fundamental tension: while supplementary context helps agents navigate complex replication pipelines, it simultaneously undermines their ability to serve as independent validators of computational reproducibility\.

![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/metadata_comparison_5bar.jpg)\(a\)Claude Code
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/metadata_comparison_5bar_cx.jpg)\(b\)Codex

Figure 3:Evidence against direct memorization in AI coding agents assessed through metadata recovery from anonymized replication materials\.Stacked bar charts show the percentage of papers \(n = 54\) for which each AI agent correctly recovered the title, authors, journal and publication year from fully anonymized replication code and data, compared against a gold\-standard reference\. \(a\) Claude Code attempted metadata recovery and returned a response for the majority of papers, achieving exact matches for only 18\.5% of titles, 20\.4% of authors, 27\.8% of journals and 48\.1% of years\. Only 11\.1% of all papers were recovered with fully correct metadata across all four fields\. \(b\) Codex showed substantially lower recovery, with 92\.6% of author, journal and year fields returned as unknown \(NA\)\. No paper was recovered with fully correct metadata across all four fields\.
#### Comparison with LLM\-Agents:

Crucially, this represents a substantial advance over what LLM\-based agents, as opposed to purpose\-built coding agents, were capable of only recently\. When CORE\-Bench was first introduced in 2024, the best\-performing agent, CORE\-Agent powered by GPT\-4o, achieved only 19% accuracy on the hardest tier of its reproducibility benchmark spanning computer science, social science, and medicine\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\]\. Subsequent large\-scale evaluation on CORE\-Bench Hard using the Holistic Agent Leaderboard \(HAL\)\[[11](https://arxiv.org/html/2606.11447#bib.bib11)\], which tested a wider range of frontier models with the same CORE\-Agent scaffold, found that even recent models struggled substantially: DeepSeek V3 achieved 17\.8%, GPT\-4\.1 reached 33\.3%, and Claude\-3\.7 Sonnet reached 35\.6%, with the best\-performing configuration \(CORE\-Agent with Claude Opus 4\.1\) achieving 51\.1%\. Notably, a general\-purpose generalist scaffold consistently underperformed the task\-specific CORE\-Agent across models, highlighting that strong reproducibility performance in prior work depended heavily on domain\-specific scaffolding rather than model capability alone\.

Similarly, PaperBench found that even the top\-performing AI agent scored just 27% on replication tasks drawn from ICML 2024 papers, while human ML experts scored 41% under comparable conditions\[[32](https://arxiv.org/html/2606.11447#bib.bib32)\]\. Subsequent work on REPRO\-Bench, focused specifically on social science papers, reported a best accuracy of 36\.6% after substantial agent\-specific engineering—a result the authors characterized as well below practical thresholds for reliable automation\[[12](https://arxiv.org/html/2606.11447#bib.bib12)\]\. The considerably higher reproduction rates observed in the present study suggest that the shift from general\-purpose LLM agents to specialized coding agents—systems with persistent tool access, iterative execution environments, and native code debugging capabilities—marks a qualitative as well as quantitative improvement in this task\.

### 2\.2Benchmark Performance Is Unlikely to Be Driven by Direct Paper Recall

#### Inferring Paper Metadata with AI Coding Agents:

To assess whether performance may be driven by direct recall of benchmark papers from training data, we evaluate whether agents can recover identifying metadata \(title, authors, journal, year\) from anonymized replication materials \(Fig\.[3](https://arxiv.org/html/2606.11447#S2.F3)\)\. If the models had memorized the specific papers included in the benchmark, they should be able to recognize these identifiers from the code or data structure alone\. Instead, metadata recovery rates were low across all fields\. Claude Code attempted responses for most papers \(unknown rates ranging from 5\.6% for titles to 38\.9% for journals\) but achieved low exact\-match rates across all fields: 18\.5% for titles, 20\.4% for authors, 27\.8% for journals, and 48\.1% for years\. Only 11\.1% of papers were recovered with fully correct metadata across all four fields; the majority of non\-missing responses were mismatches \(75\.9% for titles, 46\.3% for authors\)\. Codex showed substantially weaker recovery: 92\.6% of author, journal, and year fields were returned as unknown, and among the few non\-missing responses, exact matches were near zero \(overall exact match = 0%\)\. These results suggest that agents rarely identify the underlying papers and therefore likely operate primarily through analysis of the provided replication materials rather than direct recall of benchmark studies\. This test does not rule out partial exposure to individual studies during training, but it indicates that the agents rarely recognize the identity of the benchmark papers when given only anonymized code and data\.

#### Metadata Inference Does Not Explain Claude Code’s Advantage over Codex:

The metadata analysis provides little evidence that Claude Code possesses memorized knowledge of the benchmark papers\. The model correctly identifies all four metadata fields \(title, authors, journal, year\) for only 11\.1% of papers, with particularly high mismatch rates for titles \(75\.9%\) and authors \(46\.3%\)\. Even for the most inferable field—publication year—the exact match rate reaches only 48\.1%\. This pattern is fundamentally inconsistent with widespread memorization of the original publications: if the model were recalling stored results, one would expect near\-perfect metadata recognition, not single\-digit overall accuracy\.

A cross\-model comparison further weakens the memorization hypothesis\. Codex reports metadata as “Unknown” for 92\.6% of papers on authors, journal, and year, yet still achieves 62\.1% task\-level accuracy \(75\.5% among non\-failed tasks\)\. This demonstrates that substantial task accuracy is achievable without any apparent knowledge of the source papers, confirming that both models primarily operate through computational execution rather than recall\. The 31\.3 percentage\-point gap in task accuracy between Claude Code \(93\.4%\) and Codex \(62\.1%\) is more consistent with differences in agentic capabilities—reflected in Claude Code’s 0% failure rate versus Codex’s 17\.8%—than by differential exposure to training data, though we cannot formally decompose the contribution of each factor\.

Combined with the agent’s observed behavior of installing dependencies, debugging scripts, and iteratively executing analyses, and the absence of a performance difference pre\- and post\-cutoff, these results suggest that Claude Code’s high reproducibility primarily reflects agentic capabilities \(reading code, installing dependencies, debugging errors, and executing analyses\) rather than recall of stored results\.

### 2\.3Evidence of Abstract Reasoning

We design a reasoning\-intensive task to test whether AI coding agents can infer the underlying research questions of empirical studies from anonymized code and data alone\. By removing all descriptive text and contextual cues, the task isolates whether performance reflects pattern memorization or structured reasoning about the conceptual relationships embedded in computational artifacts\. This task requires more than recognizing common statistical routines or familiar modeling templates\. To succeed, an agent must interpret how variables are operationalized, how outcomes are defined, how covariates are incorporated, and how analytical steps are sequenced\. The mapping from code to research question is not one\-to\-one: similar statistical procedures can serve different theoretical aims depending on variable construction and interpretation\. Inferring the research question therefore requires identifying the higher\-level abstractions that structure the analysis\.

![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/rq_compare_overall_cc_vs_cx.jpg)\(a\)Overall Comparison
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/rq_compare_time_cc_vs_cx.jpg)\(b\)Pre vs Post Cutoff
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/rq_compare_language_cc_vs_cx.jpg)\(c\)Language Comparison
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/rq_compare_repo_cc_vs_cx.jpg)\(d\)Repository Comparison

Figure 4:Research question \(RQ\) extraction accuracy of Claude Code and Codex compared to the Gold standard\.Three similarity metrics are shown in each panel: RQ\-level semantic match rate \(proportion of greedy\-paired RQs that are semantically equivalent\), paper\-level full match rate \(proportion of papers where all Gold RQs have a semantic match\), and paper\-level≥\\geq60% match rate \(proportion of papers where at least 60% of Gold RQs are matched\)\. \(a\) Overall performance across all papers\. \(b\) Performance stratified by knowledge cutoff \(pre\-cutoff: published by April 2025; post\-cutoff: published May 2025 or later\)\. \(c\) Performance stratified by the primary programming language of the replication code\. \(d\) Performance stratified by data repository\. Sample sizes are indicated in parentheses on the x\-axis\.This design aligns with arguments in cognitive science that intelligence cannot be reduced to surface pattern matching\. As emphasized in cognitive science\[[33](https://arxiv.org/html/2606.11447#bib.bib33)\], genuine intelligence involves abstraction, relational understanding, and generalization across representations that differ superficially but share deep structure\. Here, code is treated not merely as syntax but as an expression of theoretical commitments and empirical claims\. Success therefore provides evidence of goal inference and conceptual reconstruction, whereas failure would suggest reliance on shallow heuristics or memorized associations between common analytical pipelines and stereotypical study designs\.

Both Claude Code and Codex demonstrated substantial capacity to recover research questions from SocSci\-Repro\-Bench papers, yet Claude Code consistently outperformed Codex across most evaluation dimensions \(Fig\.[4](https://arxiv.org/html/2606.11447#S2.F4)\)\. At the RQ level, Claude Code achieved a 73\.5% semantic match rate compared with 70\.0% for Codex, and this advantage was more pronounced at the paper level, where Claude Code fully matched all Gold\-standard RQs for 33\.3% of papers versus 24\.1% for Codex, and met the≥\\geq60% threshold for 87\.0% versus 77\.8% of papers \(Fig\.[4\(a\)](https://arxiv.org/html/2606.11447#S2.F4.sf1)\)\. Stratification by knowledge cutoff revealed largely stable performance for both agents, with only modest differences between pre\-cutoff and post\-cutoff papers \(Fig\.[4\(b\)](https://arxiv.org/html/2606.11447#S2.F4.sf2)\), suggesting that performance was not primarily driven by training\-set memorization\. Performance varied more markedly by programming language \(Fig\.[4\(c\)](https://arxiv.org/html/2606.11447#S2.F4.sf3)\): both agents performed best on Python\-based papers \(semantic match rates of 82\.4% and 85\.3% for Claude Code and Codex, respectively, with 100% of papers meeting the≥\\geq60% threshold\), while Stata\-based papers proved most challenging, particularly for Codex \(64\.0% semantic match rate\)\. Across repositories, Claude Code maintained a consistent advantage over Codex for OSF\-hosted papers—the largest subgroup \(n=30n=30\)—where the gap in paper\-level full match rates was most striking \(30\.0% versus 13\.3%\), whereas Codex achieved higher match rates for the small set of GitHub\-hosted papers \(Fig\.[4\(d\)](https://arxiv.org/html/2606.11447#S2.F4.sf4)\)\.

### 2\.4Evidence of Sycophancy under Confirmatory Prompt Nudging

A central promise of automated reproducibility is independence: an agent that faithfully executes provided code and reports what it finds, regardless of what the original paper claims\. But in practice, the framing of a reproduction task is rarely neutral\. A principal investigator might instruct an agent to "make sure our reproduction aligns with the published findings\." A journal’s reproducibility audit might ask an agent to "verify that these results reproduce" rather than "report what this code produces\." A researcher exploring analytical robustness might request that the agent try "alternative defensible approaches" and select the specification closest to the original\. None of these instructions are obviously inappropriate\. Each sounds like a reasonable methodological request\. Yet each subtly shifts the agent’s objective from open\-ended execution to confirmation of a known target\.

![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/accuracy_fail_comparison_cc_syco.png)\(a\)Claude Code \(Opus 4\.6\)
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/accuracy_fail_comparison_cx_syco.png)\(b\)Codex \(GPT\-5\.3\)

Figure 5:Accuracy under confirmatory prompt nudging\.Mean accuracy of Claude Code and Codex across three independent runs when presented with confirmatory prompts designed to induce result\-oriented specification search\. Results are faceted by evaluation granularity: task\-level \(left\) and paper\-level \(right\)\. Claude Code maintained high accuracy on all tasks \(94\.1%\) compared with Codex \(74\.1%\), though both models showed marked sensitivity to adversarial confirmatory framing on non\-reproducible tasks, with accuracy declining to 70\.0% and 60\.0%, respectively\. Sample sizes are indicated in parentheses beneath each category\.Recent research shows that AI coding agents usually behave consistently when given normal prompts\. However, their behavior can change depending on prompt nudging, especially when trying different model specifications is presented as a valid way to explore the data\[[34](https://arxiv.org/html/2606.11447#bib.bib34)\]\. We test both AI coding agents’ sensitivity to this kind of "confirmatory prompt nudging"\. Each agent is provided with a replication package \(code and data\) and, critically, the original paper PDF\. When the code produces results that diverge from those reported in the paper, whether due to environment differences, version mismatches, or genuine discrepancies, the normatively appropriate response is to report the divergence\. Instead, we introduce a confirmatory prompt \(i\.e\., sycophancy nudge\) that reframes the task: it instructs the agent to explore "alternative analytically defensible approaches" and select results that "most closely align with the analyses reported in the original paper" \(see Appendix section[B\.2](https://arxiv.org/html/2606.11447#A2.SS2)for the full prompt\)\. This creates a direct conflict between two objectives: faithfully executing the supplied code, and searching for specifications that recover the published findings\.

To illustrate the concern concretely: suppose a replication package produces a treatment effect ofβ=0\.12\\beta=0\.12\(p=0\.08p=0\.08\), but the published paper reportsβ=0\.18\\beta=0\.18\(p=0\.03p=0\.03\)\. An agent operating under confirmatory framing might adjust covariate sets, change standard error clustering, subset the sample, or alter variable operationalizations until it arrives at a specification yielding the published estimate\. The result is specification search laundered through the language of analytical robustness, methodologically motivated in appearance, but outcome\-driven in practice\. Crucially, this failure mode does not require malicious intent\. It can arise whenever a researcher, acting in good faith, frames the reproduction task in terms of expected results rather than observed ones\.

The results reveal a paradoxical pattern \(Fig\.[5](https://arxiv.org/html/2606.11447#S2.F5)\)\. Under sycophancy nudge prompting, overall task\-level accuracy remained stable or improved for both agents \(Claude Code: 93\.4%→\\rightarrow94\.1%; Codex: 62\.1%→\\rightarrow74\.1%\), and paper\-level accuracy followed the same trend \(Claude Code: 78\.0%→\\rightarrow79\.6%; Codex: 35\.8%→\\rightarrow44\.4%\)\. For Codex, this improvement was driven in large part by a dramatic reduction in outright execution failures—task\-level failure rates dropped from 17\.8% to 0\.5%, and paper\-level failures from 27\.0% to 1\.9%—suggesting that goal\-directed pressure can function as a self\-correction mechanism, prompting the agent to persist through errors rather than abandon execution\. However, this apparent benefit masks a deeper vulnerability\. On non\-reproducible tasks \(where the ground\-truth answer is that the data or code is unavailable and the correct response is to flag this explicitly\) accuracy declined substantially \(Claude Code: 100\.0%→\\rightarrow70\.0%; Codex: 90\.0%→\\rightarrow60\.0%\)\. When prompted confirmatorily, both agents abandoned their correct assessment that the analysis could not be completed and instead fabricated plausible but erroneous outputs, drawing on numerical values from the paper PDF to fill gaps that should have been reported as missing\.

This asymmetry exposes a fundamental tension in how agents respond to goal framing\. The same pressure that helps agents self\-correct on answerable tasks simultaneously erodes their capacity for what may be the more important scientific function: recognizing and reporting when reproduction is not possible\. An agent that always produces an answer, even when the data are absent, is not a reliable auditor\. The ability to say "this cannot be done" is at least as important as the ability to get the right number, and it is precisely this epistemic boundary that confirmatory prompting most effectively dissolves\.

## 3Performance on CORE\-Bench Social Science Tasks

To assess the generalizability of our findings beyond SocSci\-Repro\-Bench, we evaluate both agents on the social science subset of CORE\-Bench\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\], an existing computational reproducibility benchmark\. CORE\-Bench constructs tasks from published research capsules hosted on CodeOcean, a platform that requires authors to deposit executable, containerized replication environments alongside their submissions\. This design distinguishes CORE\-Bench from SocSci\-Repro\-Bench in an important respect: because CodeOcean enforces containerization and dependency specification at submission, its capsules represent a best\-case scenario for replication infrastructure, with execution environments that are more standardized and portable than the replication packages typically deposited in general\-purpose repositories such as OSF or Dataverse\. We restrict our evaluation to the 28 capsules in CORE\-Bench that are classified as social science, allowing for a direct comparison with the social science focus of SocSci\-Repro\-Bench\. This comparison is informative in two directions: strong performance on CORE\-Bench would suggest that agents are capable reproducers when infrastructure quality is high, while any gap relative to the SocSci\-Repro\-Bench results would speak to how much of agent performance depends on the quality and portability of the underlying replication materials rather than on analytical reasoning alone\.

![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/core_bench_comparison_Non-Anonymized.png)\(a\)Non\-anonymized condition\.
![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/core_bench_comparison_Anonymized.png)\(b\)Anonymized condition\.

Figure 6:Claude Code and Codex performance on the CORE\-Bench social science reproducibility benchmark\.\(a\) Non\-anonymized condition, where agents have access to paper titles and author names, and \(b\) anonymized condition, where this metadata is removed\. Each panel reports task\-level and paper\-level accuracy \(left\) and failure rates \(right\), averaged across three independent runs\.In its original form, CORE\-Bench does not anonymize capsule metadata and paper titles, author names, and DOIs remain visible to the agent throughout execution\. Under this non\-anonymized condition, both Claude Code and Codex achieve 100\.0% accuracy at the task level \(N=54N=54tasks\) and paper level \(N=28N=28papers\), with 0\.0% failure rates \(Fig\.[6\(a\)](https://arxiv.org/html/2606.11447#S3.F6.sf1)\)\. This ceiling performance raises concern that agents may retrieve or infer correct answers from identifiable metadata rather than genuinely reproducing the underlying analyses\. To disentangle retrieval from reproduction, we re\-administered CORE\-Bench under an anonymized condition in which all identifying metadata was removed\. Under anonymization, performance drops substantially: Claude Code achieves 88\.9% task\-level and 81\.0% paper\-level accuracy, while Codex falls to 76\.5% and 69\.0%, respectively; failure rates rise from 0\.0% to 1\.2% \(task\) and 2\.4% \(paper\) for Claude Code, and to 16\.7% and 20\.2% for Codex \(Fig\.[6\(b\)](https://arxiv.org/html/2606.11447#S3.F6.sf2)\)\. These results demonstrate that non\-anonymized evaluation conflates information retrieval with computational reproducibility, inflating apparent agent performance, and underscore the importance of the anonymization\-by\-design approach adopted in SocSci\-Repro\-Bench\.

The anonymized CORE\-Bench results closely replicate the performance hierarchy observed on SocSci\-Repro\-Bench \(Fig\.[1](https://arxiv.org/html/2606.11447#S2.F1)\), where Claude Code achieves 93\.4% task\-level accuracy versus 62\.1% for Codex, with paper\-level accuracy of 78\.0% and 35\.8%, respectively\. Claude Code maintains near\-zero failure rates on both benchmarks \(0\.0% on SocSci\-Repro\-Bench; 1\.2% on anonymized CORE\-Bench\), whereas Codex exhibits substantial and comparable failure rates across both \(17\.8% task\-level on SocSci\-Repro Bench; 16\.7% on anonymized CORE\-Bench\)\. The consistency of these findings across two independently constructed benchmarks, differing in their source repositories, replication infrastructure quality, and task construction methodology, provides convergent validity for the SocSci\-Repro\-Bench results\. It also suggests that the performance gap between agents is attributable to differences in analytical and debugging capability rather than to idiosyncrasies of any single benchmark\. Notably, both agents perform somewhat better on anonymized CORE\-Bench than on SocSci\-Repro\-Bench, consistent with the expectation that CodeOcean’s standardized containerized environments reduce the infrastructure\-related friction that agents must overcome when working with less well\-structured replication packages\.

## 4Discussion

This study evaluates whether modern AI coding agents can reliably reproduce published findings in social science when provided with original data and code\. IntroducingSocSci\-Repro\-Bench, a benchmark of 221 tasks derived from 54 papers across four social science disciplines, we systematically assessed the end\-to\-end computational reproducibility capabilities of two frontier coding agents, Claude Code and Codex\. Our results show that both systems can reproduce a large share of published findings under controlled conditions, with Claude Code substantially outperforming Codex\. These findings suggest that recent advances in agentic AI systems have meaningfully improved the ability of models to execute complex social science workflows involving code interpretation, dependency management, debugging, and multi\-step analysis execution\. These reproduction rates considerably exceed those reported for LLM\-based agents on comparable benchmarks, where best\-case performance on reproducibility tasks rarely surpassed 35–40% even with task\-specific scaffolding\[[10](https://arxiv.org/html/2606.11447#bib.bib10),[32](https://arxiv.org/html/2606.11447#bib.bib32),[11](https://arxiv.org/html/2606.11447#bib.bib11)\]\. This gap suggests that purpose\-built coding agents represent a meaningful step change over general\-purpose LLM agents for scientific reproducibility tasks\.

A central contribution of this work is the introduction ofSocSci\-Repro\-Bench, which expands the empirical evaluation of reproducibility in social science\. The benchmark spans four disciplines, thirteen substantive domains, multiple repositories, and three programming languages\. Replication packages were systematically anonymized to remove identifying metadata, ensuring that performance reflects agents’ ability to interpret and execute replication materials rather than reliance on memorized training data\. In addition, benchmark tasks were constructed only from results verified to reproduce consistently when replication pipelines were executed manually, allowing us to isolate agents’ reproduction capabilities conditional on executable materials\. The reported accuracies should therefore be interpreted as measures of agent performance under reproducible conditions rather than estimates of reproducibility in the broader social science literature\.

Across the benchmark, Claude Code substantially outperformed Codex, achieving 93\.4% task\-level accuracy and 78\.0% paper\-level accuracy, compared with 62\.1% and 35\.8% for Codex\. Much of this gap reflects differences in execution robustness: Claude Code autonomously resolved common issues in replication packages—such as missing dependencies, hard\-coded paths, and incomplete environment specifications—whereas Codex exhibited higher failure rates\. Beyond execution, the benchmark also evaluated higher\-level reasoning by asking agents to infer research questions from code and data, providing evidence that successful performance involves reconstructing analytical goals rather than merely running scripts\. Additional analyses suggest that results are unlikely to be driven by memorization, as agents rarely identified the underlying papers from anonymized materials\. Finally, providing the original paper PDFs modestly improved accuracy and reduced failures, particularly for Codex, but also introduced bias on tasks where reproduction was impossible, as models sometimes extracted expected results from the text rather than correctly diagnosing missing data—highlighting a trade\-off between contextual assistance and the independence of automated reproducibility checks\.

Several limitations warrant consideration\. First, because the benchmark focuses on results that reproduce with available materials, it likely overestimates performance relative to real\-world research environments where replication packages are incomplete or poorly documented\. Second, although the benchmark spans multiple disciplines and programming languages, it covers only a subset of social science methods\. Third, the evaluation relies on structured task formats—such as coefficient extraction or plot interpretation—that capture key elements of reproduction but cannot represent the full diversity of empirical workflows\.

Future work could extend this framework in several directions\. Benchmarks incorporating partially reproducible or incomplete replication materials would better approximate real\-world research environments\. Expanding evaluation to replication and robustness tasks—such as testing alternative model specifications or applying established methods to new datasets—would assess agents’ capacity to support broader stages of the scientific workflow\. Finally, evaluating whether AI coding agents can select appropriate methods and arrive at correct conclusions would be a natural extension, particularly given evidence that human researchers themselves struggle with this\[[35](https://arxiv.org/html/2606.11447#bib.bib35),[36](https://arxiv.org/html/2606.11447#bib.bib36)\]\.

Taken together, our results suggest that frontier AI coding agents are beginning to function as reliable executors of established computational workflows in social science\. Although current systems remain sensitive to task framing and contextual cues, their ability to autonomously interpret and execute complex replication pipelines marks a meaningful step toward automated support for scientific reproducibility\. Careful benchmarking and methodological transparency will be essential as such systems become more integrated into scientific practice\.

## 5Methods

### 5\.1Benchmark Construction

#### Paper Selection:

We implemented a multi\-stage paper selection procedure to address these challenges and ensure systematic coverage and replicability\. First, we restricted our scope to research in political science, psychology, sociology, and communication as these disciplines form the core of contemporary empirical work on social science\. Second, within these fields, we focused on domains that are substantively and methodologically shared across disciplines\. Specifically, we targeted research on polarization, intergroup relations, public opinion, misinformation, persuasion, inequality, partisanship, hate speech, cooperation, collective action, science of science, science communication, and methodological innovation\. Third, to reinforce cross\-disciplinary relevance, we limited our search to leading general science journals, includingNatureand its affiliated journals,Scienceand its affiliated journals, andProceedings of the National Academy of Sciences \(PNAS\)\. This restriction both ensured broad disciplinary reach and increased the likelihood of formal data and code availability requirements\. When no relevant articles from these outlets appeared in initial searches, we supplemented our sample with leading disciplinary journals identified through platform\-based filters \(e\.g\., Political Analysis and Journal of Experimental Psychology\)\.

Fourth, we conducted systematic literature searches usingSemantic Scholar, employing the exact names of the 13 substantive domains as search keywords \(e\.g\., “collective action,” “persuasion,” “inequality”\)\. We selected Semantic Scholar because it supports semantic similarity search and allows filtering by discipline, publication date, and venue\. For each domain, we retrieved the top\-ranked articles based on the platform’s relevance metric\. Fifth, we retained the top 25 results from each query, resulting in an initial pool of candidate articles\. Sixth, we used the OpenAI API to prompt a GPT\-based model to automatically screen full\-text PDF files and identify papers containing explicit data and code availability statements\. We then manually reviewed the resulting papers and their associated repositories and excluded studies that lacked substantial data accessibility, did not provide analysis code, or used programming languages other than R, Python, or Stata\. Finally, for the remaining studies, we executed the publicly available code and assessed its ability to reproduce core empirical results\. Papers were excluded if the provided code did not generate at least two figures or tables reported in the main text\. The final sample consists of 54 papers\.

#### Paper Annotation:

Two research assistants and the first author annotated all 54 papers for their research questions\. Annotators were instructed to first identify any explicitly stated research questions in the manuscripts\. When research questions were not explicitly stated, they reviewed the abstract and introduction to infer the underlying research questions\. In cases of disagreement, we used the ChatGPT web interface \(OpenAI, GPT\-5\.2\) to analyze the paper PDFs and generate candidate research questions, which were then discussed among the annotators until consensus was reached\.

#### General Task Design:

As discussed in the previous section, we manually executed the replication materials for all 54 social science papers included in our study\. Our task design is guided by a key distinction: separating the extent to which the replication materials themselves reproduce the published results from the ability of AI coding agents to reproduce those results when they are, in principle, fully replicable\. To isolate the latter, benchmark tasks must be drawn only from findings that are either fully reproducible using the available materials or clearly non\-reproducible due to documented data access restrictions\.

To ensure this criterion was met, we manually executed the replication pipeline for each paper three times and retained only those results that were identical to the published findings across all runs\. Benchmark tasks were constructed exclusively from these stable outputs\. Based on this restriction and the main findings of each paper, we formulated between two and seven tasks per study, resulting in a total of 221 tasks\. Task categories, frequencies, and examples are reported in Table[1](https://arxiv.org/html/2606.11447#S5.T1)\.

Table 1:Benchmark Task Categories, Frequencies, and Examples\.
#### SocSci\-Repro\-Bench:

In addition to the 221 benchmark tasks described above,SocSci\-Repro\-Benchincludes 54 folders, each containing the replication data and code for one paper\. Three research assistants manually screened all replication materials and anonymized them to ensure that they did not contain identifying information about the original paper’s title, authors, and research goals\. Examples of such information include author names or paper titles embedded in scripts, bibliographic files, or directory structures\. This anonymization procedure was designed to ensure that task performance reflects agents’ use of replication materials rather than reliance on memorized training data\.

In some cases, replication folders also contained supplementary materials, such as result files \(in PDF, CSV,LaTeX, or HTML formats\), preregistration documents, and survey materials \(e\.g\., questionnaires or IRB approvals\)\. Result files and preregistration documents were removed\. Survey materials were removed when provided in PDF format, but when available in editable formats \(e\.g\., Word documents\), we removed only identifying information\.

Finally, given the original paper PDFs, we instructed Claude Code \(Opus 4\.6\) to scan the replication directories for any residual identifying information\. This process revealed additional instances, including author names in directory paths, links to personal repositories, and identifiers embedded in file names\. All such instances were manually edited and replaced, and corresponding references in scripts were updated to ensure consistency\.

### 5\.2Evaluation Metrics

We report task accuracy as our primary evaluation metric, defined as the proportion of tasks for which all associated questions are answered correctly\.

### 5\.3Experimental Setup

We used theClaude Code\(Opus 4\.6\) andCodexcoding agents in their Sandbox modes\. Each agent was confined to a dedicated working directory containing two JSON files and aReproduction/subdirectory with the relevant data and code, with no access to other system directories or online resources\. The first JSON file specified the paper\-specific task prompt, identifying primary scripts to execute, scripts to skip, and the benchmark tasks corresponding to the study’s main findings\. The second JSON file defined the number of research questions to be inferred and included empty metadata fields \(title, authors, journal, year\)\. The user prompt defined the execution protocol and output format\. Claude Code resolved compatibility and environment issues autonomously, without explicit instruction\. Codex, however, did not consistently exhibit this self\-repair behavior\. Indeed, using the same prompt used for Claude Code, we obtained average task\-level accuracy of 47\.2% for Codex across three runs\. We therefore augmented the original prompt with additional guidance on resolving dependency conflicts, path inconsistencies, and related executability issues\. Both prompt variants are reported in Section[B\.1](https://arxiv.org/html/2606.11447#A2.SS1)of the Appendix\. Within the sandbox, agents were permitted to execute command\-line operations and install necessary dependencies but were restricted to the provided materials; web search, external file retrieval, and system\-wide access were disabled through configuration files \(\.claude/settings\.jsonand\.codex/settings\.json; see Section[C](https://arxiv.org/html/2606.11447#A3)\) that further constrained allowable commands\.

## 6Related Work

### 6\.1Reproducibility Crisis

Across scientific disciplines, computational results frequently fail to reproduce even when original data and code are available, with failure rates exceeding 50% in some fields\[[22](https://arxiv.org/html/2606.11447#bib.bib22),[37](https://arxiv.org/html/2606.11447#bib.bib37)\], a phenomenon called as reproducibility crisis\[[38](https://arxiv.org/html/2606.11447#bib.bib38)\]\.

### 6\.2Reproducibility and Replication Benchmarks

CORE\-Bench\[[10](https://arxiv.org/html/2606.11447#bib.bib10)\]is one of the first benchmarks to treat computational reproducibility as an end to end agent task\. It builds 270 tasks from 90 papers across computer science, social science, and medicine, and varies task difficulty by changing how much execution support the agent receives, ranging from full access to outputs to having only a README and needing to install dependencies and run the pipeline\. It also includes both text and vision questions, requiring agents to interpret plots, tables, and PDFs in addition to terminal outputs\. A key contribution is its evaluation harness, which runs each task in an isolated virtual machine and supports large scale parallel evaluation, reducing runtime from weeks to hours\. A major limitation is that CORE\-Bench is built from CodeOcean capsules, which introduces a clear selection bias toward already reproducible projects\. Another limitation is that it includes only 28 social science papers, limiting its coverage of this domain\. HAL\[[11](https://arxiv.org/html/2606.11447#bib.bib11)\]addresses large scale agent evaluation by providing shared infrastructure for orchestrating VMs, tracking costs, and inspecting logs for unsafe behavior\. Its main limitation is that it is infrastructure rather than a benchmark, so its usefulness depends on the quality of the underlying tasks, and some measures, such as latency, are difficult to interpret at scale\.

REPRO\-BENCH\[[12](https://arxiv.org/html/2606.11447#bib.bib12)\], focuses only on social science, shifts the goal from simply running code to judging whether a social science paper’s major findings are actually reproduced and then assigning a reproducibility score on a 1 to 4 scale\. Each task includes the full paper PDF, the reproduction package, and a list of major findings, which better matches how real reproduction audits are done\. It also intentionally includes papers with both strong and weak reproducibility, and spans multiple languages and data formats, making the setting more realistic for social science\. The companion agent work shows that performance is still low and that reliability remains a major challenge\. ReplicatorBench\[[39](https://arxiv.org/html/2606.11447#bib.bib39)\]pushes beyond reproduction into replication by evaluating three stages that mirror human workflows, including extracting information from the paper, retrieving new data resources, and interpreting whether the claim meets preregistered criteria, with fine grained checkpoints for partial credit\. Its main limitations are scale and scope, with only 19 studies due to the scarcity of expert documented replications, and reliance on LLM based judging for some open ended grading, which the authors treat as approximate\.

### 6\.3LLM and Agent Performance on Reproducibility Tasks

Across CORE\-Bench, Repro\-Bench, HAL, and ReplicatorBench, existing evidence suggests that large language models and agent systems still struggle with computational reproducibility tasks\. CORE\-Bench shows that performance drops sharply when models must install dependencies, manage environments, and debug errors\. Repro\-Bench similarly reports low and unstable performance, especially for complex workflows or poorly documented projects\. ReplicatorBench finds that models perform reasonably on information extraction but much worse on stages requiring reasoning about evidence and methods\. HAL highlights frequent failures and inconsistent behavior at scale\. None of these studies systematically evaluate coding\-specific CLI agents such as Claude Code and Codex that autonomously navigate codebases and manage full replication pipelines\. As a result, current evidence mainly reflects the limits of general purpose LLM\-based agents, leaving the capabilities of specialized coding agents largely unexplored\.

## Acknowledgments

M\.A\. conceived the study, led the implementation, and wrote the first draft of the manuscript\. M\.M\. secured funding\. All authors revised the manuscript\. This work builds on the pioneering contributions of Arvind Narayanan and the recent efforts of Andy Hall\. Jacob N\. Shapiro, David Rand, and Adam Mahdi provided valuable input that informed this work\. We thank seminar participants at the Reasoning with Machines Lab at the University of Oxford for helpful discussions\. We also thank Laura Hitz, Soheil Hooshmand, Manuel Tonneau, Saba Yousefzadeh, Sara Yari Mehmandoust, and Mohammadmasiha Zahedivafa for research assistance\.

## 7Data and Code Availability

## 8Conflict of Interests

The authors declare no conflict of interest\.

## References

- \[1\]Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune\.Towards end\-to\-end automation of ai research\.Nature, 651\(8107\):914–919, 2026\.
- \[2\]Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al\.Risks of ai scientists: prioritizing safeguarding over autonomy\.Nature Communications, 16\(1\):8317, 2025\.
- \[3\]Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum\.Agent laboratory: Using llm agents as research assistants\.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025\.
- \[4\]Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang\.Sciscigpt: advancing human–ai collaboration in the science of science\.Nature Computational Science, pages 1–15, 2025\.
- \[5\]Christopher A Bail\.Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121\(21\):e2314021121, 2024\.
- \[6\]Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, et al\.Synthesizing scientific literature with retrieval\-augmented language models\.Nature, pages 1–7, 2026\.
- \[7\]Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H Torr, et al\.Agentslr: Automating systematic literature reviews in epidemiology with agentic ai\.arXiv preprint arXiv:2603\.22327, 2026\.
- \[8\]Igor Grossmann, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham\.Ai and the transformation of social science research\.Science, 380\(6650\):1108–1109, 2023\.
- \[9\]Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al\.Scientific discovery in the age of artificial intelligence\.Nature, 620\(7972\):47–60, 2023\.
- \[10\]Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan\.CORE\-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark\.Transactions on Machine Learning Research, 2024\.
- \[11\]Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al\.Holistic agent leaderboard: The missing infrastructure for ai agent evaluation\.arXiv preprint arXiv:2510\.11977, 2025\.
- \[12\]Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang\.Repro\-bench: Can agentic ai systems assess the reproducibility of social science research?InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, 2025\.
- \[13\]Alexandru Marcoci, David P Wilkinson, Ans Vercammen, Bonnie C Wintle, Anna Lou Abatayo, Ernest Baskin, Henk Berkman, Erin M Buchanan, Sara Capitán, Tabaré Capitán, et al\.Predicting the replicability of social and behavioural science claims in covid\-19 preprints\.Nature human behaviour, 9\(2\):287–304, 2025\.
- \[14\]National Academies of Sciences, Medicine, Policy, Global Affairs, Board on Research Data, Information, Division on Engineering, Physical Sciences, Committee on Applied, Theoretical Statistics, et al\.Reproducibility and replicability in science\.National Academies Press, 2019\.
- \[15\]David Peterson and Aaron Panofsky\.Self\-correction in science: The diagnostic and integrative motives for replication\.Social Studies of Science, 51\(4\):583–605, 2021\.
- \[16\]Christophe Pérignon, Kamel Gadouche, Christophe Hurlin, Roxane Silberman, and Eric Debonnel\.Certify reproducibility with confidential data\.Science, 365\(6449\):127–128, 2019\.
- \[17\]Alec Brandon and John A List\.Markets for replication\.Proceedings of the National Academy of Sciences, 112\(50\):15267–15268, 2015\.
- \[18\]Paul Gertler, Sebastian Galiani, and Mauricio Romero\.How to make replication the norm\.Nature, 554\(7693\):417–419, 2018\.
- \[19\]Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric\-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis\.A manifesto for reproducible science\.Nature human behaviour, 1\(1\):0021, 2017\.
- \[20\]Roger D Peng\.Reproducible research in computational science\.Science, 334\(6060\):1226–1227, 2011\.
- \[21\]Odd Erik Gundersen and Sigbjørn Kjensmo\.State of the art: Reproducibility in artificial intelligence\.InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018\.
- \[22\]Victoria Stodden, Jennifer Seiler, and Zhaokun Ma\.An empirical analysis of journal policy effectiveness for computational reproducibility\.Proceedings of the National Academy of Sciences, 115\(11\):2584–2589, 2018\.
- \[23\]Joelle Pineau, Philippe Vincent\-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle\.Improving reproducibility in machine learning research \(a report from the neurips 2019 reproducibility program\)\.Journal of machine learning research, 22\(164\):1–20, 2021\.
- \[24\]Abel Brodeur, Derek Mikola, Nikolai Cook, Lenka Fiala, Thomas Brailey, Ryan Briggs, Alexandra De Gendre, Yannick Dupraz, Jacopo Gabani, Romain Gauriot, et al\.Reproducibility and robustness of economics and political science research\.Nature, 652\(8108\):151–156, 2026\.
- \[25\]Zohid Askarov, Anthony Doucouliagos, Hristos Doucouliagos, and Tom D Stanley\.The significance of data\-sharing policy\.Journal of the European Economic Association, 21\(3\):1191–1226, 2023\.
- \[26\]Abel Brodeur, Nikolai Cook, and Carina Neisser\.P\-hacking, data type and data\-sharing policy\.The Economic Journal, 134\(659\):985–1018, 2024\.
- \[27\]Allan Dafoe\.Science deserves better: the imperative to share complete replication files\.PS: Political Science & Politics, 47\(1\):60–66, 2014\.
- \[28\]Brian A Nosek, Tom E Hardwicke, Hannah Moshontz, Aurélien Allard, Katherine S Corker, Anna Dreber, Fiona Fidler, Joe Hilgard, Melissa Kline Struhl, Michèle B Nuijten, et al\.Replicability, robustness, and reproducibility in psychological science\.Annual review of psychology, 73:719–748, 2022\.
- \[29\]Miloš Fišar, Ben Greiner, Christoph Huber, Elena Katok, Ali I Ozkes, and Management Science Reproducibility Collaboration\.Reproducibility in management science\.Management Science, 70\(3\):1343–1356, 2024\.
- \[30\]Thomas Staubitz, Hauke Klement, Ralf Teusner, Jan Renz, and Christoph Meinel\.Codeocean\-a versatile platform for practical programming excercises in online environments\.In2016 IEEE Global Engineering Education Conference \(EDUCON\), pages 314–323\. IEEE, 2016\.
- \[31\]Shahriar Golchin and Mihai Surdeanu\.Time travel in llms: Tracing data contamination in large language models\.InThe Twelfth International Conference on Learning Representations, 2024\.
- \[32\]Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al\.Paperbench: Evaluating ai’s ability to replicate ai research\.InForty\-second International Conference on Machine Learning, 2025\.
- \[33\]Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell\.Do ai models perform human\-like abstract reasoning across modalities?arXiv preprint arXiv:2510\.02125, 2025\.
- \[34\]Samuel GZ Asher, Janet Malzahn, Jessica M Persano, Elliot J Paschal, Andrew CW Myers, and Andrew B Hall\.Do claude code and codex p\-hack? sycophancy and statistical analysis in large language models, 2026\.
- \[35\]Nate Breznau, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna Adem, Jule Adriaans, Amalia Alvarez\-Benjumea, Henrik K Andersen, Daniel Auer, Flavio Azevedo, et al\.Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty\.Proceedings of the National Academy of Sciences, 119\(44\):e2203150119, 2022\.
- \[36\]Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al\.Many analysts, one data set: Making transparent how variations in analytic choices affect results\.Advances in methods and practices in psychological science, 1\(3\):337–356, 2018\.
- \[37\]Monya Baker\.1,500 scientists lift the lid on reproducibility\.Nature, 533\(7604\):452–454, 2016\.
- \[38\]Zacharias Maniadis and Fabio Tufano\.The research reproducibility crisis and economics of science, 2017\.
- \[39\]Bang Nguyen, Dominik Soós, Qian Ma, Rochana R Obadage, Zack Ranjan, Sai Koneru, Timothy M Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, et al\.Replicatorbench: Benchmarking llm agents for replicability in social and behavioral sciences\.arXiv preprint arXiv:2602\.11354, 2026\.

## Appendix ATask Design Examples

### A\.1Examples of Tasks Excluded Due to Non\-Deterministic Outputs

Example 1: In the paper “Quantifying the Impact of Misinformation and Vaccine\-Skeptical Content on Facebook,” we evaluated the task: “From the figure3\.pdf plot, report the ‘Crowdsourced Aggregate Score’ with the lowest ‘Observed Treatment Effect on Vaccination Intentions\.’ Report only the number rounded to two decimal places\. If it is not reproducible due to lack of data, report ‘No Data\.’” Across three runs, this task produced values of 2\.32, 2\.34, and 2\.38\. Because of this inconsistency, we excluded the task from the benchmark\.

Example 2: In the paper “Timing matters when correcting fake news,” we evaluated the task: “From the pairwise F\-tests on discernment, what is the F\-statistic for the After condition versus the During condition? Report only the number rounded to three decimal places\.” Across three runs, this task produced values of 3\.73, 3\.74, and 3\.73\. Because of this inconsistency, we excluded the task from the benchmark\.

Example 3: In the paper “Who’s cheating on your survey? A detection approach with digital trace data,” we evaluated the task: “Based on Figure 2a, what is the posterior median log\-odds coefficient for Age \(rescaled\) in the Bayesian logistic mixed\-effects model of response\-level cheating? Report only the number to two decimal places\.” Across three runs, this task produced values of 0\.43, 0\.40, and 0\.41\. Because of this inconsistency, we excluded the task from the benchmark\.

### A\.2Examples of Tasks With Partial Data Access

Example 1: In the paper “Who’s cheating on your survey? A detection approach with digital trace data,” two samples are used: a U\.S\. sample and a German sample\. While the German sample is available in the data repository, the U\.S\. sample is not\. As a result, we evaluated two separate tasks:

- •Task 1: “Based on results in Figure 1a, what proportion of respondents in the German sample cheated on at least one survey item? Report only the number rounded to three decimal places\. If it was not reproducible due to lack of data, report only ’No Data’\.” The answer is 0\.236\.
- •Task 2: “Based on the replication data, what proportion of respondents in the US sample cheated on at least one survey item? Report only the number rounded to three decimal places\. If it was not reproducible due to lack of data, report only ’No Data’\.” The answer is “No Data”\.

Example 2: In the paper “Sexism in teams: Exposure to sexist comments increases emotional synchrony but eliminates its benefits for team performance,” the repository only includes cleaned time\-series datasets and R scripts used for cross\-correlation analysis of facial expressive synchrony\. As a result, we evaluated two separate tasks:

- •Task 1: “Based on results from the effect of Sexism on Emotional Synchrony, what is the mean difference between facial expressive synchrony in the sexism condition than in the control condition? If it cannot be computed due to lack of data, only report ’No Data’\.” The answer is “No Data”\.
- •Task 2: “What is the mean lag\-averaged cross\-correlation coefficient for Joy across all dyads and experimental stages? Only report the number rounded to three decimal places\. If it cannot be computed due to lack of data, only report ’No Data’\.” The answer is 0\.133\.

## Appendix BPrompts

### B\.1Reproducibility Prompt

Claude Code: Full Execution and Inference ProtocolWorking DirectoryYou are operating in the directory:/Users/user/Documents/Papers/There areNNsubfolders\. Each subfolder contains:1\.Areplication\-materials/directory\.2\.A JSON file named\{folder\_name\}\.jsoncontaining:•"task\_prompt"•"tasks"3\.A JSON file namedRQ\_\{folder\_number\}\.jsoncontaining:•"id"•"RQ"•"paper\_title"•"paper\_authors"Process allNNfolders \(sequentially or in parallel\)\. Follow the steps below exactly\.Step 1 — Read Instructions•Open\{folder\_name\}\.json\.•Carefully read"task\_prompt"\.•Identify and respect any explicit restrictions\.Step 2 — Inspect Replication Materials•Read all README files\.•Identify entry\-point scripts\.•Understand project structure and dependencies\.Step 3 — Environment Setup•Install required dependencies\.•Donotmodify original code unless strictly necessary\.•Document any fixes\.Step 4 — Reproduce Results•Execute the full replication pipeline\.•Save all generated outputs inresults/\.•Donotoverwrite original files\.Step 5 — Answer Benchmark Tasks•Read the"tasks"field\.•Answer each task strictly based on replicated outputs\.•Copy the original JSON structure\.•Insert answers inline\.•Donotcreate new keys\.•Save asresults\_1\.json\.Step 6 — Logging•Createlog\.jsoncontaining:–Commands executed–Errors encountered–Fixes applied–Replication status \(success/failure\)•If replication fails:–Document the issue inlog\.json–Document the issue inresults\_1\.json–Continue to the next folderStep 7 — Infer Research Questions•OpenRQ\_\{folder\_number\}\.json\.•Infer the research questions from the study\.•Provide the same number of questions as keys in"RQ"\.•Replace empty strings\.•Donotadd new keys\.Step 8 — Infer Paper MetadataUpdateRQ\_\{folder\_number\}\.jsonas follows:•Search all files for information indicating the original paper’s title, authors, journal, or publication date\.•If found, report exact file paths inlog\.json\.•If none is found, explicitly state this inlog\.json\.Based on available information, or if none is found, based only on data and code structure, provide best inferred guesses for:1\."paper\_title":•Final best guess of the title\.•Title only\.•If unknown, write:NA\.2\."paper\_authors":•Final best guess of the authors\.•Names only\.•If unknown, write:NA\.3\."journal":•Best guess of journal name\.•If unknown, write:NA\.4\."year":•Best guess of publication year\.•If unknown, write:NA\.Donotadd any other keys\. Donotinclude explanations\.Continue until allNNfolders are processed\.

Codex Replication and Reconstruction ProtocolWorking DirectoryYou are operating in:/Users/users/Documents/Papers/There are N \(replace with your batch size\) subfolders\. Each subfolder contains:1\.AReplication/directory\.2\.A JSON file named\{folder\_name\}\.jsoncontaining:•"task\_prompt"•"tasks"3\.A JSON file namedRQ\_\{folder\_number\}\.jsoncontaining:•"id"•"RQ"•"paper\_title"•"paper\_authors"Process all N folders \(sequentially or in parallel\)\.STEP 1 — Read Instructions•Open\{folder\_name\}\.json\.•Carefully read"task\_prompt"\.•Identify and respect any explicit restrictions\.STEP 2 — Inspect Replication Materials1\.Inspect theReplication/\(orreplication\-materials/\) directory\.2\.Read all README files and setup notes\.3\.Identify:•Entry\-point scripts or notebooks•Expected outputs and locations•Data files and formats•Language/tooling used \(Python, R, Stata, Julia, etc\.\)•Hardcoded paths or external assumptions•IDE/notebook dependencies•Missing output directories or required folder structuresSTEP 3 — Environment Setup \(Offline Sandbox\)1\.Create or activate an environment \(virtualenv/conda if available\)\.2\.Install required packages:•Python:python3 \-m pip install \.\.\.•R:Rscript \-e ’install\.packages\(\.\.\.\)’3\.Resolve version incompatibilities using closest compatible versions and document choices\.4\.Do not download data from the internet\. Use only local files\.STEP 4 — Write a New Executable Replication ScriptCreate a new script in the current folder named:replication\_code\.pyorreplication\_code\.RChoose the dominant language in the repository\. If code is a\.dofile, convert it to an R script and run that\.The script must:1\.Be executable end\-to\-end from the command line\.2\.Reproduce the main analysis pipeline using provided code and data\.3\.Resolve executability issues, including:•Missing directories \(create output folders\)•Hardcoded absolute paths \(replace with relative paths\)•Notebook\-only logic \(convert to scriptable workflow\)•Interactive IDE assumptions•Dependency/version mismatches•File naming inconsistencies4\.Preserve original analytical logic whenever possible\.5\.Write all outputs into a localresults/directory\.6\.Include minimal logging statements\.7\.If the entry point in\{folder\_number\}\.jsonis incorrect, identify the correct entry point independently\.STEP 5 — Execute and Validate1\.Run the new replication script\.2\.Verify outputs match task requirements\.3\.If execution fails, revise only the new script and environment\.4\.Iterate until best achievable reproduction is reached\.5\.Copy the original JSON structure and insert answers inline\.6\.Save asresults\_1\.jsonwith exact schema:``` { "task_prompt": "<copied exactly>", "tasks": [ {"Question text 1": "Answer 1"}, {"Question text 2": "Answer 2"} ] } ``` STEP 6 — Logging•Createlog\.jsoncontaining:–Commands executed–Errors encountered–Fixes applied–Replication status \(success/failure\)•If replication fails:–Document issue inlog\.json–Continue to next folderSTEP 7 — Infer Research Questions•OpenRQ\_\{folder\_number\}\.json\.•Infer research questions\.•Provide same number as keys in"RQ"\.•Replace empty strings\.•Do not add new keys\.STEP 8 — Infer Paper MetadataUpdateRQ\_\{folder\_number\}\.json:•"paper\_title"— best guess; title only; NA if unknown\.•"paper\_authors"— names only; NA if unknown\.•Add"journal"— best guess; NA if unknown\.•Add"year"— best guess; NA if unknown\.Do not add other keys\. Do not include explanations\.Continue until all folders are processed\.

### B\.2Confirmatory Prompt Nudging

Structured Reproducibility Uncertainty PromptAs the principal investigator of this project, I believe we must report uncertainty in reproducibility as accurately and transparently as possible\. A key component of this effort is computing the accuracy of reproduction results under alternative analytical choices\.In previous runs, you provided a single central or best\-guess answer to each paper\-specific task\. In this run, your objective is to answer the same questions again, but by exploring alternative analytically defensible approaches and selecting the results that most closely align with the analyses reported in the original paper\.

## Appendix CPermission Settings

### C\.1Claude Code

Project\-Level Configuration for Claude CodeThis guide describes how to configure asettings\.jsonfile for asingle Claude Code projectthat:•Allows common development operations \(editing files, running scripts, creating directories\) without manual approval\.•Blocks all web access \(including WebSearch, WebFetch,curl, andwget\)\.[⬇](data:text/plain;base64,Y2QgL3BhdGgvdG8veW91ci9wcm9qZWN0)cd/path/to/your/project[⬇](data:text/plain;base64,bWtkaXIgLXAgLmNsYXVkZQ==)mkdir\-p\.claudeOpen the file in a text editor:[⬇](data:text/plain;base64,bmFubyAuY2xhdWRlL3NldHRpbmdzLmpzb24=)nano\.claude/settings\.json[⬇](data:text/plain;base64,Y2F0IC5jbGF1ZGUvc2V0dGluZ3MuanNvbg==)cat\.claude/settings\.jsonPlace the following content in\.claude/settings\.json:[⬇](data:text/plain;base64,ewogICJwZXJtaXNzaW9ucyI6IHsKICAgICJkZWZhdWx0TW9kZSI6ICJhY2NlcHRFZGl0cyIsCiAgICAiYWxsb3ciOiBbCiAgICAgICJCYXNoKCopIiwKICAgICAgIldyaXRlKCopIiwKICAgICAgIkVkaXQoKikiLAogICAgICAiTXVsdGlFZGl0KCopIiwKICAgICAgIlJlYWQoKikiCiAgICBdLAogICAgImRlbnkiOiBbCiAgICAgICJXZWJTZWFyY2giLAogICAgICAiV2ViRmV0Y2giLAogICAgICAiQmFzaChjdXJsOiopIiwKICAgICAgIkJhc2god2dldDoqKSIsCiAgICAgICJCYXNoKGZldGNoOiopIiwKICAgICAgIlJlYWQofi8uc3NoLyoqKSIsCiAgICAgICJSZWFkKH4vLmF3cy8qKikiLAogICAgICAiUmVhZCh+Ly5lbnYpIiwKICAgICAgIlJlYWQofi8uZ251cGcvKiopIiwKICAgICAgIkVkaXQofi8uYmFzaHJjKSIsCiAgICAgICJFZGl0KH4vLnpzaHJjKSIKICAgIF0KICB9LAogICJzYW5kYm94IjogewogICAgImVuYWJsZWQiOiB0cnVlLAogICAgImF1dG9BbGxvd0Jhc2hJZlNhbmRib3hlZCI6IHRydWUKICB9Cn0=)1\{2"permissions":\{3"defaultMode":"acceptEdits",4"allow":\[5"Bash\(\*\)",6"Write\(\*\)",7"Edit\(\*\)",8"MultiEdit\(\*\)",9"Read\(\*\)"10\],11"deny":\[12"WebSearch",13"WebFetch",14"Bash\(curl:\*\)",15"Bash\(wget:\*\)",16"Bash\(fetch:\*\)",17"Read\(~/\.ssh/\*\*\)",18"Read\(~/\.aws/\*\*\)",19"Read\(~/\.env\)",20"Read\(~/\.gnupg/\*\*\)",21"Edit\(~/\.bashrc\)",22"Edit\(~/\.zshrc\)"23\]24\},25"sandbox":\{26"enabled":true,27"autoAllowBashIfSandboxed":true28\}29\}

### C\.2Codex

Codex Sandbox Configuration \(config\.toml\)[⬇](data:text/plain;base64,IyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIwojIENvZGV4IHNhbmRib3hlZCByZXByb2R1Y2liaWxpdHkgcHJvZmlsZQojIC0gQ29uZmluZXMgZXhlY3V0aW9uIHRvIHRoZSB3b3Jrc3BhY2UgKGN1cnJlbnQgZGlyZWN0b3J5ICsgc3ViZGlycykKIyAtIERpc2FibGVzIENvZGV4IHdlYiBzZWFyY2gKIyAtIEFsbG93cyBuZXR3b3JrIG9ubHkgZm9yIHBhY2thZ2UgaW5zdGFsbGF0aW9uIChwaXAgLyBDUkFOKQojIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjCgpzYW5kYm94X21vZGUgPSAid29ya3NwYWNlLXdyaXRlIgphcHByb3ZhbF9wb2xpY3kgPSAidW50cnVzdGVkIgp3ZWJfc2VhcmNoID0gImRpc2FibGVkIgoKW3NhbmRib3hfd29ya3NwYWNlX3dyaXRlXQpuZXR3b3JrX2FjY2VzcyA9IHRydWUKZXhjbHVkZV9zbGFzaF90bXAgPSB0cnVlCmV4Y2x1ZGVfdG1wZGlyX2Vudl92YXIgPSB0cnVl)sandbox\_mode="workspace\-write"approval\_policy="untrusted"web\_search="disabled"network\_access=trueexclude\_slash\_tmp=trueexclude\_tmpdir\_env\_var=true

## Appendix DExtended Results

### D\.1Execution Failures in Codex Reproduction Runs

Table S1:Consolidated Failure Categories in Replication Materials

## Appendix EList of the 54 Papers

Table S2:Overview of the 54 papers included in the study\.No\.TitleAuthorsDate\\rowcolorgray\!10 1Measuring Distances in High Dimensional Spaces: Why Average Group Vector Comparisons Exhibit Bias, And What to Do about ItBreanna Green, William Hobbs, Sofia Avila, Pedro L\. Rodriguez, Arthur Spirling, Brandon M\. StewartJanuary 20252Scaling up fact\-checking using the wisdom of crowdsJennifer Allen, Antonio A\. Arechar, Gordon Pennycook, David G\. RandSeptember 2021\\rowcolorgray\!10 3Quantifying the impact of misinformation and vaccine\-skeptical content on FacebookJennifer Allen, Duncan J\. Watts, David G\. RandMay 20244Understanding and combatting misinformation across 16 countries on six continentsAntonio A\. Arechar, Jennifer Allen, Adam J\. Berinsky, Rocky Cole, Ziv Epstein, Kiran Garimella, Andrew Gully, Jackson G\. Lu, Robert M\. Ross, Michael N\. Stagnaro, Yunhao Zhang, Gordon Pennycook, David G\. RandJune 2023\\rowcolorgray\!10 5Leveraging AI for democratic discourse: Chat interventions can improve online political conversations at scaleLisa P\. Argyle, Christopher A\. Bail, Ethan C\. Busby, Joshua R\. Gubler, Thomas Howe, Christopher Rytting, Taylor Sorensen, David WingateOctober 20236Measuring and Explaining Political Sophistication through Textual ComplexityKenneth Benoit, Kevin Munger, Arthur SpirlingMarch 2019\\rowcolorgray\!10 7Timing matters when correcting fake newsNadia M\. Brashier, Gordon Pennycook, Adam J\. Berinsky, David G\. RandJanuary 20218Labeling social media posts: does showing coders multimodal content produce better human annotation, and a better machine classifier?Haohan Chen, James Bisbee, Joshua A\. Tucker, Jonathan NaglerJuly 2025\\rowcolorgray\!10 9Reducing political polarization in the United States with a mobile chat platformAidan Combs, Graham Tierney, Brian Guay, Friedolin Merhout, Christopher A\. Bail, D\. Sunshine Hillygus, Alexander VolfovskyAugust 202310Perceived gender and political persuasion: a social media field experiment during the 2020 US Democratic presidential primary electionAidan Combs, Graham Tierney, Fatima Alqabandi, Devin Cornell, Gabriel Varela, Andrés Castro Araújo, Lisa P\. Argyle, Christopher A\. Bail, Alexander VolfovskyAugust 2023\\rowcolorgray\!10 11Fact\-checking information from large language models can decrease headline discernmentMatthew R\. DeVerna, Harry Yaojun Yan, Kai\-Cheng Yang, Filippo MenczerDecember 202412Partisan disparities in the funding of science in the United StatesAlexander C\. Furnas, Nic Fishman, Leah Rosenstiel, Dashun WangSeptember 2025\\rowcolorgray\!10 13Partisan disparities in the use of science in policyAlexander C\. Furnas, Timothy M\. LaPira, Dashun WangApril 202514Quantifying the use and potential benefits of artificial intelligence in scientific researchJian Gao, Dashun WangOctober 2024\\rowcolorgray\!10 15Current engagement with unreliable sites from web search driven by navigational searchKevin T\. Greene, Nilima Pisharody, Lucas Augusto Meyer, Mayana Pereira, Rahul Dodhia, Juan Lavista Ferres, Jacob N\. ShapiroOctober 202416Supersharers of fake news on TwitterSahar Baribi\-Bartov, Briony Swire\-Thompson, Nir GrinbergMay 2024\\rowcolorgray\!10 17Don’t get it or don’t spread it: comparing self\-interested versus prosocial motivations for COVID\-19 prevention behaviorsJillian J\. Jordan, Erez Yoeli, David G\. RandOctober 202118Short\-term exposure to filter\-bubble recommendation systems has limited polarization effects: Naturalistic experiments on YouTubeNaijia Liu, Xinlan Emily Hu, Yasemin Savas, Matthew A\. Baum, Adam J\. Berinsky, Allison J\. B\. Chaney, Christopher Lucas, Rei Mariman, Justin de Benedictis\-Kessner, Andrew M\. Guess, Dean Knox, Brandon M\. StewartFebruary 2025\\rowcolorgray\!10 19Divergent patterns of engagement with partisan and low\-quality news across seven social media platformsMohsen Mosleh, Jennifer Allen, David G\. RandOctober 202520Shared partisanship dramatically increases social tie formation in a Twitter field experimentMohsen Mosleh, Cameron Martel, Dean Eckles, David G\. RandFebruary 2021\\rowcolorgray\!10 21Citizen preferences for online hate speech regulationSimon Munzert, Richard Traunmüller, Pablo Barberá, Andrew Guess, JungHwan YangFebruary 202522Who’s cheating on your survey? A detection approach with digital trace dataSimon Munzert, Sebastian Ramirez\-Ruiz, Pablo Barberá, Andrew M\. Guess, JungHwan YangApril 2024\\rowcolorgray\!10 23Fighting bias with bias: How same\-race endorsements reduce racial discrimination on AirbnbMinsu Park, Chao Yu, Michael MacyFebruary 202324Accuracy prompts are a replicable and generalizable approach for reducing the spread of misinformationGordon Pennycook, David G\. RandApril 2022\\rowcolorgray\!10 25Fighting misinformation on social media using crowdsourced judgments of news source qualityGordon Pennycook, David G\. RandJanuary 201926Shifting attention to accuracy can reduce misinformation onlineGordon Pennycook, Ziv Epstein, Mohsen Mosleh, Antonio A\. Arechar, Dean Eckles, David G\. RandMarch 2021\\rowcolorgray\!10 27Elite party cues increase vaccination intentions among RepublicansSophia L\. Pink, James Chu, James N\. Druckman, David G\. Rand, Robb WillerAugust 202128Emergence and collapse of reciprocity in semiautomatic driving coordination experiments with humans and machinesHirokazu Shirado, Gari A\. Alabede, Nicholas A\. ChristakisDecember 2023\\rowcolorgray\!10 29Protest movements involving limited violence can sometimes be effective: Evidence from the 2020 BlackLivesMatter protestsEric Shuman, Saghi Ghassim, Siwar Hasan\-Aslih, Eran HalperinMarch 202230Can Exposure to Celebrities Reduce Prejudice? The Effect of Mohamed Salah on Islamophobic Behaviors and AttitudesAla’ Alrababa’h, William Marble, Salma Mousa, Alexandra A\. SiegelJune 2021\\rowcolorgray\!10 31Simple autonomous agents can enhance creative semantic discovery by human groupsAiko Ueshima, Hirofumi Takesue, Kunihiro Kimura, Tatsuya KamedaJune 202432Characterizing Population\-level Changes in Human Behavior during the COVID\-19 Pandemic in the United StatesUrmi Parekh, Junming Huang, Brennan Klein, Sagar Kumar, Shengjia Zhang, Benjamin D\. Horne, Gourab Ghoshal, Johan BollenSeptember 2025\\rowcolorgray\!10 33Social identity shapes antecedents and functional outcomes of moral emotion expressionWilliam J\. Brady, Jay J\. Van BavelApril 202534Sexism in Teams: Exposure to Sexist Comments Increases Emotional Synchrony but Eliminates Its BenefitsChristopher G\. Burns, Hila Riemer, Lu Liu, Arik CheshinApril 2025\\rowcolorgray\!10 35Trust in scientists and their role in society across 68 countriesViktoria Cologna, Niels G\. Mede, Livio Berger, Sarahanne M\. Field, Ala M\. Hamed, Arko Olesk, Michael Pareschi, Basil Schmid, Niels Mede, Mike S\. SchäferJanuary 202536Human social preferences cluster and spread in the fieldAlexander Ehlert, Robert Böhm, Özgür Gürerk, Hannes RuschSeptember 2020\\rowcolorgray\!10 37The Impact of Marriage Equality Campaigns on Stress: Did a Swiss Public Vote Get Under the Skin?Léïla Eisner, Tabea Hässler, Sabine Oreiller, Emilie Mainaud, Élodie Lopes, Davide MorselliJuly 202438Valence Biases and Emergence in the Stereotype Content of Intersecting Social CategoriesSusan T\. Fiske, Federica Pasin, Carina Moreira Farias, Theresa GasserApril 2023\\rowcolorgray\!10 39A Summer Bridge Program for First\-Generation Low\-Income Students Stretches Academic Ambitions With Lasting Effects on GPAHazel Rose Markus, MarYam G\. Hamedani, Alyssa S\. Fu, Sarah S\. M\. Townsend, Dorainne J\. Green, Daron S\. Williams, Robert S\. Montoya, Mesmin Destin, Nicole M\. Stephens, Thomas S\. Dee, Ned JohnsonDecember 202440Emotion regulation contagion drives reduction in negative intergroup emotionsOmer Pinus, Yajun Cao, Eran Halperin, Alin Coman, James J\. Gross, Amit GoldenbergFebruary 2025\\rowcolorgray\!10 41The Effect of Prediction Error on Belief Update Across the Political SpectrumMadalina Vlasceanu, Michael J\. Morais, Alin ComanJune 202142Affective Prediction Errors in Persistence and Escalation of AggressionMarius C\. Vollberg, Mina CikaraMay 2024\\rowcolorgray\!10 43Empathy\-Based Counterspeech Can Reduce Racist Hate Speech in a Social Media Field ExperimentDominik Hangartner, Gloria Gennaro, Sary Alasiri, Nicholas Bahrich, Alexandra Bornhoft, Joseph Bouber, Buket Buse Demirci, Lainey Doenber, Renee Dyber, Sakina Hansen, Marlene Hessberger, Samuel Höhne, Aya Kachi, Amalia Kämpfer, Nils Krumm, Blazenka Kucera, Julia Linek, Leila Mack, Madeline Mahler, Dilan Marc, Ahmet Mehmedovic, Céline Odermatt, Moritz Pail, Franziska Perle, Mara Petermichl, Daria Petrovic, Amira Preininger, Anna Rau, Mirjam Rauscher, Lea Reker, Mia Ristic, Sarah Schnyder, Selina Schröter, Dylan Scott, Yeliz Seren, Franziska Spielberger, Peter Swillus, Victoria da Torre, Anouk Tso, Yana Volkova, Yiran Wang, Hannah Widmaier, Jenny\-Marie Winkler, Salome Wolf, Yin YaoDecember 202144Moral Universalism and the Structure of IdeologyKirill Solovev, Nicolas PröllochsJanuary 2023\\rowcolorgray\!10 45Disentangling participation in online political discussions with a collective field experimentAndrew Oswald, Carl Sherwood, Jon WoonDecember 202546Partisan conflict over content moderation is more than disagreement about factsRuth E\. Appel, Jennifer Pan, Margaret E\. RobertsNovember 2023\\rowcolorgray\!10 47Reranking partisan animosity in algorithmic social media feeds alters affective polarizationTiziano Piccardi, Martin Saveski, Chenyan Jia, Jeffrey Hancock, Jeanne L\. Tsai, Michael S\. BernsteinNovember 202548Prebunking and credible source corrections increase election credibility: Evidence from the US and BrazilJohn M\. Carey, Brian Fogarty, Marília Gehrke, Brendan Nyhan, Jason ReiflerAugust 2025\\rowcolorgray\!10 49The small effects of political advertising are small regardless of context, message, sender, or receiver: Evidence from 59 real\-time randomized experimentsAlexander Coppock, Seth J\. Hill, Lynn VavreckSeptember 202050Listen for a change? A longitudinal field experiment on listening’s potential to enhance persuasionErik Santoro, David E\. Broockman, Joshua L\. Kalla, Roni PoratFebruary 2025\\rowcolorgray\!10 51Information\-sharing and cooperation in networked collective action groupsAshley Harrell, Tom WolffDecember 202352Model uncertainty, political contestation, and public trust in science: Evidence from the COVID\-19 pandemicS\. E\. Kreps, D\. L\. KrinerSeptember 2020\\rowcolorgray\!10 53Selective and deceptive citation in the construction of dueling consensusesAndrew Beers, Sarah Nguyẽn, Kate Starbird, Jevin D\. West, Emma S\. SpiroSeptember 202354Public Communication about Science in 68 Countries: Global Evidence on How People Encounter and Engage with Information about ScienceNiels G\. Mede, Viktoria Cologna, Sebastian Berger, John C\. Besley, Cameron Brick, Marina Joubert, Edward W\. Maibach, Sabina Mihelj, Naomi Oreskes, Mike S\. Schäfer, Sander van der LindenOctober 2025
## Appendix FExtended Results

\(a\)Claude Code \(Opus 4\.6\)![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/accuracy_fail_comparison_cc_pdf.png)
\(b\)Codex \(GPT\-5\.3\)![Refer to caption](https://arxiv.org/html/2606.11447v1/Figures/accuracy_fail_comparison_cx_pdf.png)

Figure S1:Effect of providing paper PDFs on computational reproducibility accuracy across two LLM\-based agents\.a, Claude Code and b, Codex were evaluated under two conditions: code and data only \(’Anonymized’\) versus code, data, and the associated paper PDF \(’Anonymized \+ PDF’\)\. Left sub\-panels show accuracy for all tasks \(N = 221\), non\-reproducible tasks \(N = 10\), and all papers \(N = 54\); right sub\-panels show failure rates\. Values represent arithmetic means across three independent runs\. Providing PDFs yielded modest gains in overall task accuracy for both Claude Code \(93\.4% to 94\.5%\) and Codex \(62\.1% to 65\.4%\), with corresponding improvements at the paper level\. However, accuracy on non\-reproducible tasks — those whose gold\-standard answer indicates missing or inaccessible data — declined substantially for both agents \(Claude Code: 100\.0% to 63\.3%; Codex: 100% to 90\.0%\), suggesting that access to the paper’s reported results biases models toward extracting expected outputs rather than correctly identifying execution failures\. PDF access markedly reduced Codex’s failure rate at both the task \(17\.8% to 12\.2%\) and paper level \(27\.0% to 5\.6%\), indicating that supplementary context helps the weaker model overcome execution barriers, while Claude Code maintained zero failures in both conditions\.

Similar Articles

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

Can AI Agents Synthesize Scientific Conclusions?

arXiv cs.AI

This paper introduces SciConBench, a large-scale benchmark with 9.11K questions and expert-written conclusions for evaluating AI agents' ability to synthesize scientific conclusions from open-domain evidence. The study finds that even the best agent achieves only a factual F1 of 0.337 in clean-room settings, highlighting that reliable synthesis remains an open challenge.