HarnessX: 구성 가능, 적응형, 진화 가능 에이전트 하네스 파운드리

초록

AI 에이전트의 성능은 모델이 관찰하고 추론하고 행동하는 방식을 중재하는 프롬프트, 도구, 메모리, 제어 흐름으로 구성된 런타임 하네스에 결정적으로 의존한다. 그러나 오늘날의 하네스는 대체로 수제작되고 정적이어서, 새로운 모델이나 작업이 등장할 때마다 여전히 맞춤형 스캐폴딩이 필요하며, 실행 중에 생성되는 풍부한 트레이스는 체계적인 개선으로 거의 환류되지 않는다. 우리는 구성 가능하고 적응 가능하며 진화 가능한 에이전트 하네스의 파운드리인 HarnessX를 소개한다. HarnessX는 대치 대수를 통해 타입화된 하네스 프리미티브를 조립하고, 기호적 적응과 강화 학습 간의 운영적 거울에 기반한 트레이스 기반 다중 에이전트 진화 엔진인 AEGIS를 통해 이를 적응시키며, 궤적을 하네스 업데이트와 모델 훈련 신호로 변환하여 하네스-모델 루프를 닫는다. 다섯 가지 벤치마크(ALFWorld, GAIA, WebShop, tau^3-Bench, SWE-bench Verified)에서 HarnessX는 평균 +14.5%(최대 +44.0%)의 향상을 보이며, 기준 성능이 가장 낮은 곳에서 향상 폭이 가장 크다. 이러한 결과는 에이전트의 진보가 모델 확장에서만 비롯될 필요가 없음을 시사한다. 즉, 실행 피드백으로부터 런타임 인터페이스를 구성하고 진화시키는 것은 실행 가능하고 보완적인 지렛대이다. 전체 코드베이스는 향후 릴리스에서 오픈소스로 공개될 예정이다.

English

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.