FlashAdventure：一款針對圖形用戶界面代理在多樣化冒險遊戲中完整劇情線索解決的基準測試平台

摘要

基于大语言模型（LLM）的图形用户界面（GUI）代理在多样化的数字环境中展现出交互潜力。其中，视频游戏因其界面多变而成为宝贵的测试平台，尤其是冒险游戏，其复杂的叙事驱动互动带来了额外挑战。然而，现有的游戏基准测试缺乏多样性，且鲜少评估代理完成整个故事情节的能力。为此，我们推出了FlashAdventure，一个包含34款基于Flash的冒险游戏的基准测试集，旨在检验完整故事弧的完成度，并应对观察与行为之间的差距——即记忆并基于早期游戏信息采取行动的难题。同时，我们提出了CUA-as-a-Judge，一种自动化的游戏玩法评估器，以及COAST，一个利用长期线索记忆来更好地规划和解决序列任务的代理框架。实验表明，当前的GUI代理在完成完整故事弧方面存在困难，而COAST通过弥合观察与行为之间的差距，显著提高了里程碑任务的完成率。尽管如此，人类与表现最佳代理之间仍存在显著差异，这提示我们需要持续的研究努力来缩小这一差距。

English

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure：一款針對圖形用戶界面代理在多樣化冒險遊戲中完整劇情線索解決的基準測試平台

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

摘要

Support