持續駕馭：自我改進基礎代理的在線適應

摘要

像Claude Code和OpenHands這類編碼框架，會透過工具、記憶與規劃來支撐基礎模型，然而在具身智能體的長期部分可觀察決策領域，卻不存在對應的方案。我們首先報告自身的Gemini Plays Pokemon (GPP)實驗。透過迭代的人機協作框架優化，GPP成為首個能完成《Pokemon Blue》、《Pokemon Yellow Legacy（困難模式）》以及零敗績通關《Pokemon Crystal》的AI系統。在最困難的階段，智能體本身開始透過長期上下文記憶迭代自身的策略，在人機協作優化之外，浮現出自主的自我改進訊號。持續框架（Continual Harness）則完全將人類移出這個循環：它是一個無需重置、可自我改進的具身智能體框架，將我們觀察到的現象形式化並自動化。僅從最簡的環境介面出發，智能體交替進行行動與優化自身提示、子智能體、技能與記憶，並能利用任意過往軌跡資料。提示優化方法需要回合重置；持續框架則在單次執行中線上適應。在《Pokemon Red》與《Pokemon Emerald》上，搭配前沿模型使用時，持續框架從零開始，相較於最簡基線大幅降低了按鍵成本，並縮小了與手工打造專家框架之間的大部分差距，且具備能力相關的增益——儘管它使用的原始介面完全相同，沒有預設知識、沒有手工工具、也沒有領域支架。最後，我們讓模型本身形成閉環：一個線上過程-獎勵協同學習循環，其中開源智能體透過優化框架產生的軌跡，會由前沿教師模型重新標註，並用於更新該模型，從而在《Pokemon Red》上持續推動遊戲內里程碑進展，且訓練迭代之間無需重置環境。

English

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

持續駕馭：自我改進基礎代理的在線適應

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

摘要

Support