GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Hugging Face Daily Papers 06/16/26, 12:00 AM Papers

game-generation coding-agents benchmark godot ai-agents evaluation

Summary

GameCraft-Bench is a benchmark for evaluating AI coding agents on end-to-end game generation from natural language descriptions using the Godot engine. The strongest agent achieves only 41.46%, showing the task remains highly challenging.

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

Original Article

View Cached Full Text

Cached at: 06/17/26, 03:35 AM

Paper page - GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Source: https://huggingface.co/papers/2606.17861 Published on Jun 16

#1 Paper of the day Authors:

Abstract

End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.

Game generationis an emerging application ofcoding agents, requiring models to transformnatural-language specificationsinto playable interactive systems. Unlike traditional coding tasks,game generationtakes place within agame engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-endgame generationas the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, andInteractive Verification. We propose an interaction-grounded evaluation framework that assessesexecutable gameplaythrough replayed demonstrations and rubric-guidedmultimodal judging. We instantiate this framework asGameCraft-Bench, a benchmark comprising 140Godottasks across 15 game families. Evaluations of frontiercoding agentsshow that end-to-endgame generationremains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

View arXiv page View PDF Project page GitHub11 Add to collection

Get this paper in your agent:

hf papers read 2606\.17861

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17861 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17861 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17861 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Paper page - GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

CEO-Bench: Can Agents Play the Long Game?

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

CreativeGame:Toward Mechanic-Aware Creative Game Generation

@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…

Submit Feedback

Similar Articles

CEO-Bench: Can Agents Play the Long Game?

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

CreativeGame:Toward Mechanic-Aware Creative Game Generation

@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…