継続的活用：自己改善型基盤エージェントのためのオンライン適応

要旨

Claude CodeやOpenHandsのようなコーディングハーネスは、基盤モデルをツール、メモリ、計画でラップするが、身体化エージェントの長期的な部分観測下での意思決定には同等のものは存在しない。まず、我々のGemini Plays Pokemon (GPP)実験について報告する。反復的な人間参加型ハーネス改良により、GPPは『ポケモン青』、『ポケモン黄レガシー（ハードモード）』、『ポケモンクリスタル』を、一度もバトルに敗れることなくクリアした初のAIシステムとなった。最も困難な段階では、エージェント自体が長文脈メモリを通じて戦略を反復改良し始め、人間参加型改良と並行して創発的な自己改善信号が浮かび上がった。Continual Harnessはこのループから人間を完全に排除する。すなわち、我々が観察した内容を形式化・自動化した、身体化エージェントのためのリセット不要の自己改善ハーネスである。最小限の環境インターフェースのみから開始し、エージェントは行動と、自身のプロンプト、サブエージェント、スキル、メモリの改良を交互に行い、過去の任意の軌跡データを利用する。プロンプト最適化手法はエピソードリセットを必要とするが、Continual Harnessは単一実行内でオンライン適応を行う。フロンティアモデルを用いて『ポケモン赤』と『ポケモンエメラルド』で評価したところ、Continual Harnessはゼロから開始しても、最小限のベースラインと比較してボタン押下コストを大幅に削減し、手作業で設計された専門家ハーネスとのギャップの大部分を埋める。この改善は能力に依存した利得を伴い、同一の生のインターフェースから開始し、厳選された知識、手作業で作成されたツール、ドメイン足場を一切持たないにもかかわらず実現された。次に、モデル自身でループを閉じる。オンラインのプロセス報酬共同学習ループであり、オープンソースエージェントの改良ハーネスを通じたロールアウトをフロンティア教師が再ラベル付けし、モデルの更新に使用することで、トレーニング反復間の環境リセットなしに『ポケモン赤』における持続的なゲーム内マイルストーン進捗を実現する。

English

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

継続的活用：自己改善型基盤エージェントのためのオンライン適応

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

要旨

Support