회고적 활용 최적화: 궤적 롤아웃에 대한 자기 선호도를 통한 LLM 에이전트 개선

초록

AI 에이전트는 복잡한 문제를 해결하기 위해 스킬, 도구 및 워크플로우로 구성된 하네스(harness)에 의존합니다. 새로운 작업에 적응하려면 이 하네스를 지속적으로 개선하는 것이 필수적입니다. 그러나 기존 최적화 방법은 일반적으로 실제 정답 검증 세트를 필요로 하지만, 이러한 레이블이 지정된 데이터는 실제 배포 환경에서 획득하기 어렵습니다. 이 문제를 해결하기 위해, 본 연구에서는 과거 궤적만을 사용하여 에이전트 하네스를 최적화하는 자기지도 학습 방법인 회고적 하네스 최적화(Retrospective Harness Optimization, RHO)를 소개합니다. 구체적으로, RHO는 과거 궤적에서 다양한 난이도의 코어셋(corset)을 선택하고 이를 병렬로 재해결합니다. 에이전트는 자체 검증 및 자기 일관성(self-validation and self-consistency)을 사용하여 이러한 롤아웃을 분석한 후, 후보 하네스 업데이트를 생성하고 자체 쌍별 자기 선호(pairwise self-preference)에 따라 가장 효과적인 업데이트를 선택합니다. 우리는 소프트웨어 엔지니어링, 기술 작업, 지식 작업 등 세 가지 다양한 도메인에서 RHO를 평가했습니다. 특히, 단일 최적화 라운드에서 외부 평가 없이 SWE-Bench Pro의 통과율을 59%에서 78%로 향상시켰습니다. 또한, 분석 결과 RHO가 이전의 실패 모드를 효과적으로 타겟팅함을 보여줍니다. 결과적으로 최적화된 하네스는 에이전트의 행동 패턴을 변경하고 장기 세션 동안 높은 정확도를 유지합니다.

English

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.