Tag
Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.
TLA+ semantics guarantee nonordered updates, but the TLC model checker breaks these guarantees by requiring ordered assignments and adding effectful operators like PrintT, causing confusion for beginners.
Hillel Wayne discusses how LLMs, while popular for writing formal specifications like TLA+ and Alloy, often produce shallow, tautological properties that fail to capture subtle bugs, based on analysis of community projects.