하네스 업데이트는 하네스 이점이 아니다: 자기 진화 LLM 에이전트의 진화 능력 구분하기

초록

LLM 에이전트는 점점 더 프롬프트, 스킬, 메모리, 도구 등 외부 하네스(harness)를 중심으로 구축된 시스템으로 배치되며, 이러한 하네스는 모델 파라미터를 변경하지 않고도 작업 실행을 형성한다. 하네스 자가 진화는 실행 증거를 바탕으로 하네스를 업데이트함으로써 이러한 에이전트를 적응시킨다. 그러나 모델의 기본 작업 해결 능력이 하네스 자가 진화 능력을 예측하는지, 즉 어떤 모델이 유용한 하네스 업데이트를 생성하고 어떤 모델이 실제로 그 혜택을 받는지는 아직 명확하지 않다. 우리는 두 가지 하네스 자가 진화 능력을 분석한다: (i) 하네스 업데이트 능력, 즉 실행 증거로부터 유용한 지속적 하네스 업데이트를 생성하는 능력; (ii) 하네스 활용 이점, 즉 작업 해결 과정에서 업데이트된 하네스로부터 혜택을 받는 능력. 분석 결과 두 가지 발견점이 드러났다. 첫째, 하네스 업데이트 능력은 기본 능력과 무관하게 평탄하다: 서로 다른 능력 계층의 모델들이 생성한 하네스 업데이트는 놀라울 정도로 유사한 성능 향상을 가져온다. 심지어 Qwen3.5-9B의 업데이트조차 Claude Opus ~4.6의 업데이트에 필적하는 성능 향상을 보인다. 둘째, 하네스 활용 이점은 기본 능력에 대해 비단조적이다: 하위 계층 모델은 업데이트된 하네스로부터 거의 혜택을 받지 못하며, 중간 계층 모델이 가장 큰 혜택을 받고, 상위 계층 모델은 중간 계층보다 혜택이 적다. 우리는 하위 계층에서의 낮은 성능 향상을 두 가지 실패 모드로 추적한다: 하위 계층 모델은 관련 하네스 아티팩트를 활성화하지 못하거나, 활성화하더라도 이를 충실히 따르지 못할 수 있다. 이러한 발견은 역량 예산을 진화기(evolver)보다는 작업 해결 에이전트에 투자하고, 에이전트 훈련에서 하네스 호출 및 장기 지시 수행에 초점을 맞출 것을 시사한다. 소스 코드는 https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution 에서 공개적으로 이용 가능하다.

English

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.