Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format

Reddit r/LocalLLaMA Tools

Summary

Introduces Surface Evolver Bench, a benchmark that evaluates LLMs on writing complex physical simulations in a custom data format.

No content available
Original Article

Similar Articles

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Hugging Face Daily Papers

BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

Hugging Face Daily Papers

PRL-Bench is a comprehensive benchmark for evaluating LLMs' capabilities in frontier physics research, constructed from 100 curated Physical Review Letters papers across five physics subfields. The benchmark reveals significant gaps in current LLM performance (best scores below 50%), designed to test end-to-end research workflows, complex reasoning, and autonomous exploration.