executable-games

Tag

Cards List
#executable-games

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv cs.AI · 2d ago Cached

This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.

0 favorites 0 likes
← Back to home

Submit Feedback