An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Hugging Face Daily Papers Papers

Summary

This paper presents an exploratory case study evaluating GPT-4o's ability to perform refactoring and generate gameplay features in an endless runner game, finding that refactoring tasks succeeded while feature generation tasks mostly failed.

Large language models (LLMs) are increasingly used to support software development, but their practical usefulness in applied game-development settings remains underexplored, especially when generated code must be integrated into an existing game software system. This paper presents an exploratory empirical case study of GPT-4o in a custom Python/Pygame endless runner. The study examines six selected development tasks: three localized refactoring tasks and three tasks involving gameplay feature generation. The resulting implementations were evaluated using software metrics, unit tests, and manual gameplay assessments. In this case study, all three selected refactoring tasks were completed successfully in functional terms, whereas only one of the three selected gameplay feature generation tasks resulted in a correctly integrated feature. The findings suggest that, in this setting, GPT-4o handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems. Given the exploratory single-case design, these results are best interpreted as indicative observations rather than as generalizable evidence of category-level model performance. Overall, the paper contributes a transparent case-based account of the opportunities and limitations of LLM-assisted refactoring and gameplay feature generation in an existing game software system.
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Source: https://huggingface.co/papers/2606.21171

Abstract

Large language models demonstrate varying effectiveness in software development tasks, successfully completing localized refactoring but showing limitations in integrating new gameplay features within existing game systems.

Large language models(LLMs) are increasingly used to supportsoftware development, but their practical usefulness in applied game-development settings remains underexplored, especially when generated code must be integrated into an existing game software system. This paper presents an exploratory empirical case study of GPT-4o in a custom Python/Pygame endless runner. The study examines six selected development tasks: three localizedrefactoringtasks and three tasks involvinggameplay feature generation. The resulting implementations were evaluated usingsoftware metrics,unit tests, and manual gameplay assessments. In this case study, all three selectedrefactoringtasks were completed successfully in functional terms, whereas only one of the three selectedgameplay feature generationtasks resulted in a correctly integrated feature. The findings suggest that, in this setting, GPT-4o handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems. Given the exploratory single-case design, these results are best interpreted as indicative observations rather than as generalizable evidence of category-level model performance. Overall, the paper contributes a transparent case-based account of the opportunities and limitations of LLM-assistedrefactoringandgameplay feature generationin an existing game software system.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2606\.21171

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21171 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21171 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21171 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

PlayCoder: Making LLM-Generated GUI Code Playable

Hugging Face Daily Papers

PlayCoder introduces PlayEval benchmark and a multi-agent framework that iteratively repairs LLM-generated GUI applications, achieving up to 20.3% end-to-end playable code.

Surging developer productivity with custom GPTs

OpenAI Blog

Paf, an international gaming company, has achieved significant developer productivity gains by deploying ChatGPT Enterprise across its 100-person engineering team and creating over 85 custom GPTs for specialized coding tasks. The company reports GPT-4 is 25% more accurate than competitors and has integrated the technology into its grit:lab coding academy to train the next generation of developers.