Tag
Collider-Bench is a new benchmark that evaluates LLM agents on reproducing particle physics analyses from the Large Hadron Collider using only public papers and open software, requiring physical reasoning to fill missing implementation details.
DeepCode is a fully autonomous framework for document-to-codebase synthesis that uses principled information-flow management to convert scientific papers into production-grade code, achieving state-of-the-art results on PaperBench and surpassing PhD-level human experts.