적응형 자동 하네스: 개방형 태스크 스트림에서 에이전트 시스템 배포를 위한 지속적인 자기 개선

초록

A-Evolve, GEPA, Meta-Harness와 같은 자동 하네스 시스템은 실행 피드백으로부터 프롬프트, 스킬, 도구, 메모리 및 지원 인프라를 최적화하여 LLM 에이전트를 개선하지만, 일반적으로 고정된 오프라인 벤치마크에서 평가됩니다. 그러나 실제 배포에서는 개방형 작업 스트림이 제시됩니다. 즉, 히스토리는 고정된 종점 없이 증가하고, 이질적인 작업은 서로 다른 하네스를 필요로 하며, 문제 분포는 시간에 따라 변화합니다. 이러한 과제로 인해 반복적이고 조밀하게 업데이트되는 단일 하네스는 취약해져서 정확도가 일찍 최고점에 도달한 후 감소하는 성능 저하를 초래합니다. 이는 작업별 적응을 통한 지속적인 하네스 구축의 동기를 부여합니다. 우리는 이러한 스트림을 위한 프레임워크이자 시스템인 Adaptive Auto-Harness를 소개합니다. 이 프레임워크는 오라클 하네스와의 차이를 진화 손실과 적응 손실로 분해합니다. 시스템은 상태 저장 다중 에이전트 진화기, 해결 시간 라우팅을 포함한 하네스 트리, 그리고 히스토리에 필요한 신호가 부족한 경우를 위한 인간 개입 훅을 통해 이러한 손실을 해결합니다. 예측 시장, 보안 경쟁 및 이벤트 예측 스트림에서 Adaptive Auto-Harness는 다섯 가지 기존 자동 하네스 베이스라인보다 우수한 성능을 보였으며, 절제 연구를 통해 그 이점이 더 나은 구축, 라우팅 또는 표적 인간 개입에 기인함을 확인했습니다. 코드는 https://github.com/A-EVO-Lab/AdaptiveHarness에서 확인할 수 있습니다.

English

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in https://github.com/A-EVO-Lab/AdaptiveHarness .