GameDevBench:通过游戏开发评估智能体能力
GameDevBench: Evaluating Agentic Capabilities Through Game Development
February 11, 2026
作者: Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue
cs.AI
摘要
尽管编程智能体领域进展迅速,但其多模态对应领域的发展却相对滞后。关键挑战在于缺乏能够兼顾软件开发复杂性与深度多模态理解需求的评估测试平台。游戏开发恰好提供了这样的测试场景——智能体不仅需要驾驭庞大而密集的代码库,还需在可视化游戏场景中操作着色器、精灵图、动画等本质多模态的资产。我们推出GameDevBench,这是首个针对游戏开发任务评估智能体的基准测试平台。该基准包含132项源自网络及视频教程的任务,这些任务要求显著的多模态理解能力且复杂度极高——平均解决方案所需的代码行数和文件修改量达到先前软件开发基准的三倍以上。当前智能体在游戏开发方面仍表现不佳,最优智能体仅能完成54.5%的任务。我们发现任务感知难度与多模态复杂度呈强相关性,任务成功率从游戏玩法类任务的46.9%降至2D图形类任务的31.6%。为增强多模态能力,我们引入了两种基于图像和视频的简易反馈机制。尽管方法简单,但这些机制能持续提升性能,其中Claude Sonnet 4.5模型的表现提升最为显著,从33.3%提高至47.7%。我们公开发布GameDevBench以支持智能体游戏开发领域的进一步研究。
English
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.