AgentOdyssey：テスト時継続学習エージェントのためのオープンエンドな長期ホライズン・テキストゲーム生成

要旨

エージェントがテスト時に世界とのインタラクションから継続的に学習するためには、効果的に探索し、新たな世界知識やスキルを獲得し、関連するエピソード経験を保持し、長期的な計画を立案する能力が必要である。これらのテスト時継続学習エージェントの鍵となる能力を評価するために、我々はAgentOdysseyを導入する。これは、豊富なエンティティ、世界のダイナミクス、長期的タスクを備えたオープンエンドのテキストゲームを手続き的に生成する新しい評価フレームワークである。重要な点として、AgentOdysseyは、学習がテスト時には行われないという従来の機械学習の前提を超え、エージェントをデプロイメント全体を通じて学習と推論が交互に行われる継続的かつ長期的な設定に置く。さらに、我々はゲームの進捗だけでなく、世界知識の獲得、エピソード記憶、物体と行動の探索、行動の多様性、モデルコストに関する診断テストも提供する多面的な評価方法論を提案する。生成されたゲームにおいて多様なエージェントパラダイムを評価する。実験結果は、エージェントの重要な能力における重大な限界と、それらの意味のある地平に影響を与える要因を明らかにする。性能はより強力なベースモデルとともに拡大するものの、最良のエージェントでさえ人間の性能には遠く及ばず、改善の余地が大きく残されている。エージェントメカニズムの中では、短期記憶が複数のエージェントパラダイムに利益をもたらし、エージェントのテスト時訓練の重要な構成要素であることが判明した。

English

For agents to learn continuously from interaction with the world at test time, they must be able to explore effectively, acquire new world knowledge and skills, retain relevant episodic experiences, and plan over long horizons. To evaluate these key abilities of test-time continual learning agents, we introduce AgentOdyssey, a novel evaluation framework that procedurally generates open-ended text games with rich entities, world dynamics, and long-horizon tasks. Critically, AgentOdyssey goes beyond the conventional machine learning assumption that learning does not occur at test time by placing agents in a continuous, long-horizon setting that interleaves learning and inference throughout deployment. We further propose a multifaceted evaluation methodology that measures not only game progress but also offers diagnostic tests on world knowledge acquisition, episodic memory, object and action exploration, action diversity, and model cost. We evaluate diverse agent paradigms in the generated games. Our experimental results reveal critical limits in agents' key abilities, as well as factors that influence their meaningful horizon. Although performance scales with stronger base models, even the top agent remains far below human performance, leaving substantial headroom for improvement. Among agent mechanisms, we find that short-term memory benefits multiple agent paradigms and is an important component of agent test-time training.