長期的タスクのためのLLM意思決定とスキルバンクエージェントの共進化

要旨

長期的なインタラクティブ環境は、エージェントのスキル活用能力を評価するための試験場です。これらの環境では、多段階の推論、多数のタイムステップにわたる複数スキルの連鎖、遅延報酬や部分観測下でのロバストな意思決定が要求されます。ゲームは、エージェントのスキル活用能力を評価する優れた試験場です。大規模言語モデル（LLM）はゲームプレイエージェントとして有望な代替手段ですが、エピソード間で構造化されたスキルを発見・保持・再利用するメカニズムが欠如しているため、一貫した長期的意思決定に苦戦することが多いです。本論文ではCOSPLAYを提案します。これは、LLM決定エージェントが学習可能なスキルバンクからスキルを検索して行動決定を導き、一方でエージェント管理のスキルパイプラインが未ラベルのロールアウトから再利用可能なスキルを発見してスキルバンクを形成する共進化フレームワークです。本フレームワークは、決定エージェントによるスキル検索と行動生成の学習を改善すると同時に、スキルバンクエージェントが契約条件とともにスキルを継続的に抽出・洗練・更新します。6つのゲーム環境での実験により、80億パラメータのベースモデルを用いたCOSPLAYが、シングルプレイヤーゲームベンチマークにおいて4つの先進LLMベースラインに対し平均25.1%以上の報酬改善を達成し、マルチプレイヤー社会的推論ゲームでも競争力を維持することを示します。

English

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

長期的タスクのためのLLM意思決定とスキルバンクエージェントの共進化

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

要旨

Support