Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format

Reddit r/LocalLLaMA 07/03/26, 12:07 PM Tools

benchmark llm physical-simulation custom-data-format code-generation evaluation

Summary

Introduces Surface Evolver Bench, a benchmark that evaluates LLMs on writing complex physical simulations in a custom data format.

No content available

Original Article

Similar Articles

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

arXiv cs.AI

This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Hugging Face Daily Papers

BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

Hugging Face Daily Papers

PRL-Bench is a comprehensive benchmark for evaluating LLMs' capabilities in frontier physics research, constructed from 100 curated Physical Review Letters papers across five physics subfields. The benchmark reveals significant gaps in current LLM performance (best scores below 50%), designed to test end-to-end research workflows, complex reasoning, and autonomous exploration.

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

arXiv cs.AI

BilliardPhys-Bench is a new benchmark that tests multimodal LLMs on physical reasoning using synthetic billiards scenarios, requiring predictions of collisions and final ball positions. The paper finds that current models struggle with longer simulations and exhibit a 'stasis bias' of predicting no interaction when uncertain.

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

arXiv cs.AI

Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.

Similar Articles

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Submit Feedback