持續駕馭:自我改進基礎代理的在線適應
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
May 11, 2026
作者: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli
cs.AI
摘要
像Claude Code和OpenHands這類編碼框架,會透過工具、記憶與規劃來支撐基礎模型,然而在具身智能體的長期部分可觀察決策領域,卻不存在對應的方案。我們首先報告自身的Gemini Plays Pokemon (GPP)實驗。透過迭代的人機協作框架優化,GPP成為首個能完成《Pokemon Blue》、《Pokemon Yellow Legacy(困難模式)》以及零敗績通關《Pokemon Crystal》的AI系統。在最困難的階段,智能體本身開始透過長期上下文記憶迭代自身的策略,在人機協作優化之外,浮現出自主的自我改進訊號。持續框架(Continual Harness)則完全將人類移出這個循環:它是一個無需重置、可自我改進的具身智能體框架,將我們觀察到的現象形式化並自動化。僅從最簡的環境介面出發,智能體交替進行行動與優化自身提示、子智能體、技能與記憶,並能利用任意過往軌跡資料。提示優化方法需要回合重置;持續框架則在單次執行中線上適應。在《Pokemon Red》與《Pokemon Emerald》上,搭配前沿模型使用時,持續框架從零開始,相較於最簡基線大幅降低了按鍵成本,並縮小了與手工打造專家框架之間的大部分差距,且具備能力相關的增益——儘管它使用的原始介面完全相同,沒有預設知識、沒有手工工具、也沒有領域支架。最後,我們讓模型本身形成閉環:一個線上過程-獎勵協同學習循環,其中開源智能體透過優化框架產生的軌跡,會由前沿教師模型重新標註,並用於更新該模型,從而在《Pokemon Red》上持續推動遊戲內里程碑進展,且訓練迭代之間無需重置環境。
English
Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.