reasoning-evaluation

#reasoning-evaluation

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv cs.AI ↗ · 2026-06-02 Cached

This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.

0 favorites 0 likes

#reasoning-evaluation

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

This paper investigates the production-evaluation gap in large reasoning models (LRMs), finding that they fail to robustly evaluate reasoning despite near-perfect solution production, due to an answer confirmation bias.

0 favorites 0 likes

reasoning-evaluation

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Submit Feedback