@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…
Summary
ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.
View Cached Full Text
Cached at: 05/14/26, 04:29 AM
You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderboard for submissions soon.
facebookresearch/ProgramBench
Source: https://github.com/facebookresearch/ProgramBench

ProgramBench
Can Language Models Rebuild Programs From Scratch?
Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior.
Links
Quickstart
We recommend uv for managing Python environments.
# Run without installing
uvx programbench --help
# Or install into a project
uv pip install programbench
# Or with pip
pip install programbench
For development:
git clone https://github.com/facebookresearch/programbench.git
cd programbench
uv sync # installs editable + dev dependencies
For more details, please refer to the Usage Guide.
Citation
If our work was useful for you, please cite it:
@misc{yang2026programbenchlanguagemodelsrebuild,
title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
year={2026},
eprint={2605.03546},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2605.03546},
}
License
ProgramBench is licensed under the terms of the license found in LICENSE.
Similar Articles
ProgramBench (5 minute read)
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet?
Meta's Superintelligence Lab introduces ProgramBench, a benchmark evaluating whether state-of-the-art AI models can recreate real executable programs like ffmpeg and SQLite from scratch without internet access.
PaperBench: Evaluating AI’s Ability to Replicate AI Research
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
OpenAI introduces MLE-bench, a benchmark of 75 Kaggle ML competitions to evaluate AI agents on real-world ML engineering tasks. The best setup, o1-preview with AIDE scaffolding, achieves at least a Kaggle bronze medal in 16.9% of competitions.