Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format
Summary
Introduces Surface Evolver Bench, a benchmark that evaluates LLMs on writing complex physical simulations in a custom data format.
Similar Articles
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
PRL-Bench is a comprehensive benchmark for evaluating LLMs' capabilities in frontier physics research, constructed from 100 curated Physical Review Letters papers across five physics subfields. The benchmark reveals significant gaps in current LLM performance (best scores below 50%), designed to test end-to-end research workflows, complex reasoning, and autonomous exploration.
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
BilliardPhys-Bench is a new benchmark that tests multimodal LLMs on physical reasoning using synthetic billiards scenarios, requiring predictions of collisions and final ball positions. The paper finds that current models struggle with longer simulations and exhibit a 'stasis bias' of predicting no interaction when uncertain.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.