ChatPaper.aiChatPaper

FlashAdventure:一款用于评估GUI代理在多样化冒险游戏中完整剧情线解决能力的基准

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

September 1, 2025
作者: Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim
cs.AI

摘要

基于大语言模型(LLM)的GUI代理在多样化的数字环境中展现出交互潜力。其中,视频游戏因其多样的界面成为宝贵的测试平台,而冒险游戏则通过复杂的叙事驱动交互带来了额外挑战。然而,现有的游戏基准测试缺乏多样性,且很少评估代理完成整个故事情节的能力。为此,我们推出了FlashAdventure,一个包含34款基于Flash的冒险游戏的基准测试,旨在检验完整故事线的完成度,并应对观察与行为之间的差距——即记忆并基于早期游戏信息采取行动的挑战。我们还提出了CUA-as-a-Judge,一个自动化的游戏评估器,以及COAST,一个利用长期线索记忆来更好地规划和解决序列任务的代理框架。实验表明,当前的GUI代理在完成完整故事线方面存在困难,而COAST通过弥合观察与行为之间的差距,显著提高了里程碑任务的完成率。尽管如此,人类与表现最佳代理之间仍存在显著差距,这提示我们需要持续的研究努力来缩小这一鸿沟。
English
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
PDF181September 3, 2025