Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
Summary
This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.
View Cached Full Text
Cached at: 05/14/26, 06:19 AM
# Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse Source: [https://arxiv.org/abs/2605.12808](https://arxiv.org/abs/2605.12808) [View PDF](https://arxiv.org/pdf/2605.12808) > Abstract:Neuroscience data are highly fragmented across labs, formats, and experimental paradigms, and reuse often requires substantial manual effort\. A persistent roadblock to data reuse and integration is the need to decipher bespoke and diverse data formatting choices\. Common data formats have been proposed in response, but the field continues to struggle with a fundamental tension: formats flexible enough to accommodate diverse experiments are rarely descriptive enough to be self\-explanatory, and sufficiently descriptive formats demand detailed documentation and curation effort that few labs can sustain\. Agentic AI is a natural candidate to solve this problem: LLMs read code and text faster and with sustained attention to the low\-level details humans tend to skim over\. To measure how well agentic AI performs on this task, we selected eight recent papers studying large\-scale mouse neural population recordings that shared both data and code, spanning diverse recording modalities, behavioral paradigms, and dataset formats \(e\.g\., NWB, specialized APIs, and general\-purpose Python or MATLAB files\)\. We provided agents with the data, code, and paper, and prompted them to load, understand, and reformat the data for a common downstream task: training a decoder from neural activity to task or behavioral variables\. General\-purpose coding agents commonly used by scientists performed well on each sub\-task, but rarely strung together a fully error\-free end\-to\-end solution\. We characterize the types of mistakes agents made and the dataset properties that elicited them, and propose data\-sharing best practices for the agentic\-AI era\. We further find that agents\-as\-judges are unreliable at catching errors, especially without ground\-truth references, so interactive, human\-in\-the\-loop coding remains necessary\. ## Submission history From: Kristin Branson \[[view email](https://arxiv.org/show-email/f888af11/2605.12808)\] **\[v1\]**Tue, 12 May 2026 23:00:18 UTC \(11,544 KB\)
Similar Articles
A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline
This paper presents an empirical study evaluating general-purpose coding agents on a fly optogenetics data-to-discovery pipeline, finding that while agents can automate individual stages, they struggle with end-to-end tasks requiring scientific judgment and resource management.
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Introduces SciAgentArena, a benchmark of ~200 tasks for evaluating AI agents in real scientific research. Finds agents effective for well-specified data-analysis workflows but struggle with novel insights and open-ended exploration.
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
This paper presents a structured framework for benchmarking generative, multimodal, and agentic AI in healthcare, addressing the gap between high benchmark scores and real-world clinical reliability, safety, and relevance.
Most “agentic AI” conversations feel too abstract. Here is how my agentic research system looks like
The author shares a practical breakdown of an agentic research system they built to identify and evaluate AI use cases within companies. The system uses six agents for discovery, evaluation, and context extraction, emphasizing human-in-the-loop decision-making over full autonomy.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.