벤치마크만으로는 부족하다: 프로덕션 시스템에서 에이전트 모델의 런타임 평가를 위한 RAMP

초록

LLM 에이전트는 코딩 어시스턴트에서 자율 소프트웨어 엔지니어링 시스템으로 빠르게 진화하고 있다. 그러나 기존 평가 방법론은 여전히 정적이고 고립된 단기적 벤치마크에 집중되어 있어, 실제 생산 워크플로우의 동적 복잡성을 포착하지 못한다. 그 결과, 벤치마크 성능은 긴 실행 체인, 도구 상호작용, 의존성 관리, 반복적 피드백 루프를 수반하는 실제 런타임 환경에서의 실질적 역량을 제대로 반영하지 못할 수 있다. 이에 우리는 장기적 소프트웨어 엔지니어링 에이전트를 평가하기 위한 생산 기반 인프라인 RAMP를 제시한다. YatCC 통합 플랫폼 위에 구축된 RAMP는 표준화된 오케스트레이션 및 실행 인터페이스를 통해 통합 런타임 평가 아키텍처를 제공한다. RAMP는 직렬 의존성과 복잡한 툴체인 상호작용을 수반하는 실제 컴파일러 구축 워크로드를 도입하며, 부분 워크플로우 실패 시 실행 동작을 분석하기 위한 단계적 복구 메커니즘을 함께 제공한다. 이 프레임워크는 결과 품질과 프로세스 효율성을 공동으로 평가하는 유틸리티 중심의 다차원 지표를 추가로 포함한다. 우리는 15개 주류 모델에 대해 런타임 평가를 수행했으며, 기존의 고립된 벤치마크에서는 거의 드러나지 않는 상당한 능력 저하를 관찰했다. 작업 완료율은 직렬 워크플로우 전반에 걸쳐 점진적으로 감소하여 초기 단계의 100%에서 최종 단계에서는 20%에 불과했으며, 평가된 모델 중 어느 것도 전체 파이프라인을 성공적으로 완료하지 못했다. 런타임 분석 결과 체계적인 실패 전파와 상당한 자원 비효율성이 드러났으며, 유사한 모델 간 계산 비용은 최대 세 자릿수 차이를 보였다. 이러한 발견은 RAMP가 에이전트 모델 평가를 지속적이고, 런타임 관찰 가능하며, 생산 기반 평가로 발전시킨다는 것을 시사한다.

English

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.