Tag
This paper evaluates Claude Code in an agentic proving framework on the Clever benchmark for program verification, achieving over 98% success in specification generation and end-to-end verification, revealing that existing benchmarks may be insufficient for evaluating modern agentic provers.