ChatPaper.aiChatPaper

持续驾驭:面向自改进基础智能体的在线适应

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

May 11, 2026
作者: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli
cs.AI

摘要

编程框架如Claude Code和OpenHands通过工具、记忆和规划来封装基础模型,但具身智能体在长期部分可观测决策方面缺乏类似系统。我们首先报告了Gemini Plays Pokemon(GPP)实验。通过迭代式人在回路框架优化,GPP成为首个完成《宝可梦 蓝》、《宝可梦 黄》困难模式及《宝可梦 水晶》且无败绩的人工智能系统。在最困难的阶段,智能体本身开始通过长上下文记忆迭代其策略,在人在回路优化的同时浮现出涌现式自我改进信号。持续框架彻底将人从回路中移除:这是一个面向具身智能体的无重置自改进框架,将我们所观察到的现象形式化并自动化。仅从最简环境接口出发,智能体在行动与自我优化提示词、子智能体、技能和记忆之间交替切换,并可调用任何历史轨迹数据。提示优化方法需要回合重置;而持续框架能在单次运行中在线适应。在前沿模型上测试《宝可梦 红》和《宝可梦 水晶》时,持续框架从零开始相比最简基线显著降低了按键成本,并恢复了大部分与手工专家框架之间的差距,且展现出依赖模型能力的性能提升——尽管从相同的原始接口出发,没有预设知识、无手工工具、无领域脚手架。最后我们以模型自身闭环:一个在线过程-奖励协同学习循环,其中开源智能体通过优化框架生成的轨迹由前沿模型重新标注,并用于更新模型自身,从而在《宝可梦 红》中驱动持续的里程碑式进展,且无需在训练迭代之间重置环境。
English
Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.
PDF101May 14, 2026