FlashAdventure: 多様なアドベンチャーゲームにおけるフルストーリーアークを解決するGUIエージェントのためのベンチマーク

要旨

LLMを活用したGUIエージェントは、多様なデジタル環境とのインタラクションにおいて有望な可能性を示しています。中でも、ビデオゲームはその多様なインターフェースから貴重なテストベッドを提供し、特にアドベンチャーゲームは複雑で物語主導のインタラクションを通じて追加の課題を提示します。しかし、既存のゲームベンチマークは多様性に欠け、エージェントがストーリー全体を完遂する能力を評価することは稀です。この問題に対処するため、我々はFlashAdventureを導入しました。これは34のFlashベースのアドベンチャーゲームからなるベンチマークで、ストーリーアークの完遂をテストし、観察と行動のギャップ（以前のゲームプレイ情報を記憶し、それに基づいて行動する課題）に取り組むことを目的としています。また、自動化されたゲームプレイ評価ツールであるCUA-as-a-Judgeと、長期的な手がかりの記憶を活用して順次タスクを計画・解決するエージェントフレームワークCOASTを提案します。実験結果から、現在のGUIエージェントはストーリーアーク全体の完遂に苦戦している一方で、COASTは観察と行動のギャップを埋めることでマイルストーンの達成率を向上させることが示されました。しかし、人間と最高性能のエージェントとの間には依然として顕著な差があり、このギャップを縮めるための継続的な研究努力が必要です。

English

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure: 多様なアドベンチャーゲームにおけるフルストーリーアークを解決するGUIエージェントのためのベンチマーク

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

要旨

Support