FlashAdventure：一款用于评估GUI代理在多样化冒险游戏中完整剧情线解决能力的基准

摘要

基于大语言模型（LLM）的GUI代理在多样化的数字环境中展现出交互潜力。其中，视频游戏因其多样的界面成为宝贵的测试平台，而冒险游戏则通过复杂的叙事驱动交互带来了额外挑战。然而，现有的游戏基准测试缺乏多样性，且很少评估代理完成整个故事情节的能力。为此，我们推出了FlashAdventure，一个包含34款基于Flash的冒险游戏的基准测试，旨在检验完整故事线的完成度，并应对观察与行为之间的差距——即记忆并基于早期游戏信息采取行动的挑战。我们还提出了CUA-as-a-Judge，一个自动化的游戏评估器，以及COAST，一个利用长期线索记忆来更好地规划和解决序列任务的代理框架。实验表明，当前的GUI代理在完成完整故事线方面存在困难，而COAST通过弥合观察与行为之间的差距，显著提高了里程碑任务的完成率。尽管如此，人类与表现最佳代理之间仍存在显著差距，这提示我们需要持续的研究努力来缩小这一鸿沟。

English

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure：一款用于评估GUI代理在多样化冒险游戏中完整剧情线解决能力的基准

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

摘要

Support