持续驾驭：面向自改进基础智能体的在线适应

摘要

编程框架如Claude Code和OpenHands通过工具、记忆和规划来封装基础模型，但具身智能体在长期部分可观测决策方面缺乏类似系统。我们首先报告了Gemini Plays Pokemon（GPP）实验。通过迭代式人在回路框架优化，GPP成为首个完成《宝可梦蓝》、《宝可梦黄》困难模式及《宝可梦水晶》且无败绩的人工智能系统。在最困难的阶段，智能体本身开始通过长上下文记忆迭代其策略，在人在回路优化的同时浮现出涌现式自我改进信号。持续框架彻底将人从回路中移除：这是一个面向具身智能体的无重置自改进框架，将我们所观察到的现象形式化并自动化。仅从最简环境接口出发，智能体在行动与自我优化提示词、子智能体、技能和记忆之间交替切换，并可调用任何历史轨迹数据。提示优化方法需要回合重置；而持续框架能在单次运行中在线适应。在前沿模型上测试《宝可梦红》和《宝可梦水晶》时，持续框架从零开始相比最简基线显著降低了按键成本，并恢复了大部分与手工专家框架之间的差距，且展现出依赖模型能力的性能提升——尽管从相同的原始接口出发，没有预设知识、无手工工具、无领域脚手架。最后我们以模型自身闭环：一个在线过程-奖励协同学习循环，其中开源智能体通过优化框架生成的轨迹由前沿模型重新标注，并用于更新模型自身，从而在《宝可梦红》中驱动持续的里程碑式进展，且无需在训练迭代之间重置环境。

English

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

持续驾驭：面向自改进基础智能体的在线适应

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

摘要

Support