모델 스케일링에서 시스템 스케일링으로: 에이전틱 AI에서 하네스 스케일링

초록

본 연구는 에이전트 인공지능(agentic AI)에서 모델 스케일링뿐만 아니라 시스템 스케일링이 다음 주요 병목임을 분석하며, 기반 모델(foundation model) 주변에 감사 가능하고, 지속 가능하며, 모듈식이고, 검증 가능한 아키텍처의 설계를 다룬다. 본 연구는 이러한 변화를 하네스 스케일링(scaling the harness)이라고 지칭한다. 이는 기반 모델을 둘러싼 구조화된 실행 계층을 설계, 평가 및 최적화의 일급 객체로 취급하는 것이다. 최근 대규모 언어 모델은 에이전트가 도구를 사용하고, 정보를 검색하며, 메모리를 유지하고, 장기 작업 흐름을 실행할 수 있게 하지만, 평가는 여전히 대부분 모델 중심적이다. 종종 에이전트를 최종 작업 성공으로 축소하면서 메모리, 검색, 도구 사용, 오케스트레이션, 검증 및 거버넌스를 부차적인 구현 세부 사항으로 취급한다. 이러한 접근 방식은 점점 더 부적절해지고 있는데, 이는 에이전트 성능이 기반 모델, 메모리 기반, 컨텍스트 구성기, 스킬 라우팅 계층, 오케스트레이션 루프, 그리고 검증 및 거버넌스 계층 간의 상호작용에서 발생하기 때문이다. 이러한 구성 요소들은 함께 에이전트 하네스(agent harness)를 형성하며, 이는 모델의 능력을 장기적인 에이전트 행동으로 변환한다. 본 연구는 컨텍스트 거버넌스, 신뢰할 수 있는 메모리, 동적 스킬 라우팅이라는 세 가지 핵심 병목과 이를 조정하고 제약하는 오케스트레이션 및 거버넌스 메커니즘을 통해 하네스 스케일링을 탐구한다. 나아가 단일 시점 작업 성공을 넘어 궤적 품질, 메모리 위생, 컨텍스트 효율성, 통신 충실도, 검증 비용, 시간에 따른 안전한 진화를 측정하는 하네스 수준의 벤치마크를 위한 연구 의제를 제시한다. 논의를 구체화하기 위해 Python 네이티브 참조 하네스인 CheetahClaws(https://github.com/SafeRL-Lab/cheetahclaws)를 개발하고, 이를 Claude Code 및 OpenClaw와 비교한다. 본 연구의 주요 주장은 에이전트 인공지능의 미래 발전이 더 강력한 기반 모델뿐만 아니라 시스템 설계에 동등하게 의존할 것이라는 점이다.

English

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.