지속적 하네스: 자기 개선형 기반 에이전트를 위한 온라인 적응

초록

Claude Code, OpenHands와 같은 코딩 하네스는 기반 모델에 도구, 기억, 계획을 결합하지만, 구현 에이전트의 장기 지평 부분 관측 가능성 의사 결정을 위한 유사한 도구는 존재하지 않는다. 먼저, 우리는 Gemini Plays Pokemon (GPP) 실험 결과를 보고한다. 반복적인 인간-주도 하네스 개선을 통해 GPP는 Pokemon Blue, Yellow Legacy (하드 모드), 그리고 Crystal에서 단 한 번의 배틀 패배 없이 클리어한 최초의 AI 시스템이 되었다. 가장 어려운 단계에서는 에이전트 자체가 장기 문맥 기억을 통해 전략을 반복적으로 개선하기 시작했으며, 이는 인간-주도 개선과 함께 창발적인 자기 개선 신호를 표면화했다. Continual Harness는 이 루프에서 인간을 완전히 제거한다: 우리가 관찰한 바를 공식화하고 자동화한, 구현 에이전트를 위한 리셋 없는 자기 개선 하네스이다. 최소한의 환경 인터페이스만으로 시작하여 에이전트는 행동과 자체 프롬프트, 하위 에이전트, 스킬, 기억을 개선하는 과정을 번갈아 수행하며, 과거 궤적 데이터를 활용한다. 프롬프트 최적화 방법은 에피소드 리셋을 필요로 하지만, Continual Harness는 단일 실행 내에서 온라인으로 적응한다. Pokemon Red와 Emerald에서 최첨단 모델을 대상으로 한 실험에서, Continual Harness는 처음부터 시작하여 최소 기준선 대비 버튼 누름 비용을 크게 줄였으며, 수작업으로 설계된 전문가 하네스와의 격차 대부분을 복구했다. 이는 큐레이션된 지식, 수작업 도구, 도메인 스캐폴딩 없이 동일한 원시 인터페이스에서 시작했음에도 불구하고, 능력에 의존적인 이득을 보였다. 이후 우리는 모델 자체와의 루프를 닫는다: 오픈소스 에이전트가 개선 하네스를 통해 생성한 롤아웃을 최첨단 교사 모델이 재라벨링하여 모델 업데이트에 사용하는 온라인 프로세스-보상 공동 학습 루프는, 훈련 반복 간 환경을 리셋하지 않고도 Pokemon Red에서 지속적인 인게임 이정표 진전을 이끌어낸다.

English

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

지속적 하네스: 자기 개선형 기반 에이전트를 위한 온라인 적응

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

초록

Support