HarnessForge: 적응형 에이전트 시스템을 위한 하네스와 정책의 공동 진화

초록

LLM 에이전트는 서로 다른 실행 패러다임을 요구하는 이질적인 작업 체계 전반에서 작동해야 하는 경우가 점점 더 많아지고 있다. 이는 고정된 에이전트 시스템에 도전 과제를 제기하며, 개별 구성 요소 업데이트를 넘어서는 시스템 수준의 메타 적응을 촉진한다. 기존 연구에서는 외부 하네스를 적응시키거나 기반 추론 정책을 학습시켰지만, 전체 시스템 적응은 여전히 충분히 특성화되지 않았다. 구조와 실행 사이의 적응 공간은 거의 명시적으로 드러나지 않으며, 외부 하네스와 내부 추론기 간의 호환성은 공동으로 최적화되지 않는다. 본 논문에서는 LLM 에이전트 시스템을 진화시키기 위한 메타 적응형 프레임워크인 HarnessForge를 제안한다. HarnessForge는 에이전트 시스템을 하네스-정책 쌍으로 공식화하여, 하네스 수준의 실행 구조와 정책 수준의 추론 행동을 분리하는 안정적인 적응 공간을 정의한다. 그런 다음 결함 기반 하네스 조정과 하네스 조건 정책 정렬을 통해 하네스-정책 공진화를 수행한다. 다양한 도메인의 5개 벤치마크에 걸친 실험 결과, HarnessForge는 Qwen3-4B와 Qwen3-8B 백본 모두에서 일관된 성능 향상을 보였으며, 하네스 전용 및 정책 전용 베이스라인보다 최대 12.0% 더 우수한 성능을 기록하고 유리한 롤아웃 효율성 절충을 달성했다. 이는 하네스-정책 공진화가 효과적이며, 하네스와 추론 정책 간의 실행 가능한 호환성이 에이전트 시스템 적응에 필수적임을 입증한다. 코드는 https://github.com/mingju-c/HarnessForge에서 확인할 수 있다.

English

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.