GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Hugging Face Daily Papers Papers

Summary

GameCraft-Bench is a benchmark for evaluating AI coding agents on end-to-end game generation from natural language descriptions using the Godot engine. The strongest agent achieves only 41.46%, showing the task remains highly challenging.

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:35 AM

Paper page - GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Source: https://huggingface.co/papers/2606.17861 Published on Jun 16

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.

Game generationis an emerging application ofcoding agents, requiring models to transformnatural-language specificationsinto playable interactive systems. Unlike traditional coding tasks,game generationtakes place within agame engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-endgame generationas the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, andInteractive Verification. We propose an interaction-grounded evaluation framework that assessesexecutable gameplaythrough replayed demonstrations and rubric-guidedmultimodal judging. We instantiate this framework asGameCraft-Bench, a benchmark comprising 140Godottasks across 15 game families. Evaluations of frontiercoding agentsshow that end-to-endgame generationremains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

View arXiv pageView PDFProject pageGitHub11Add to collection

Get this paper in your agent:

hf papers read 2606\.17861

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17861 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17861 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17861 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

CEO-Bench: Can Agents Play the Long Game?

Hugging Face Daily Papers

CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv cs.AI

This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.