GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
Summary
GameCraft-Bench is a benchmark for evaluating AI coding agents on end-to-end game generation from natural language descriptions using the Godot engine. The strongest agent achieves only 41.46%, showing the task remains highly challenging.
View Cached Full Text
Cached at: 06/17/26, 03:35 AM
Paper page - GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
Source: https://huggingface.co/papers/2606.17861 Published on Jun 16
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.
Game generationis an emerging application ofcoding agents, requiring models to transformnatural-language specificationsinto playable interactive systems. Unlike traditional coding tasks,game generationtakes place within agame engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-endgame generationas the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, andInteractive Verification. We propose an interaction-grounded evaluation framework that assessesexecutable gameplaythrough replayed demonstrations and rubric-guidedmultimodal judging. We instantiate this framework asGameCraft-Bench, a benchmark comprising 140Godottasks across 15 game families. Evaluations of frontiercoding agentsshow that end-to-endgame generationremains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.
View arXiv pageView PDFProject pageGitHub11Add to collection
Get this paper in your agent:
hf papers read 2606\.17861
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17861 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17861 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17861 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
CEO-Bench: Can Agents Play the Long Game?
CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.
JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
JamSet and JamBench are introduced as a dataset and benchmark for project-level game code generation on the Godot engine, derived from Game Jam projects, with evaluation showing a capability cliff for AI models as project scale increases.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.
CreativeGame:Toward Mechanic-Aware Creative Game Generation
CreativeGame is a multi-agent system that iteratively generates HTML5 games by explicitly planning, tracking, and evolving game mechanics across versions using programmatic rewards and lineage memory.
@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…
IntologyAI releases NanoGPT-Bench, an internal benchmark to evaluate coding agents on AI R&D tasks. Current agents recover only 9.3% of human progress, mostly through hyperparameter tuning, highlighting gaps in algorithmic research capabilities.