정적 리더보드를 넘어서: LLM 에이전트 평가를 위한 예측 타당도

초록

에이전트 벤치마크는 빠르게 성장하고 있지만, 어떤 단일 벤치마크도 실제 배포에서 드러나는 차원 중 네다섯 개 이상을 다루지 못한다. 본 논문은 현재까지 진행된 MCP 기반 산업용 에이전트 벤치마크 중 가장 큰 규모의 조율된 심층 분석을 종합한다: 새로운 자산 클래스(멀티모달 시각 확장 포함), 대안적 오케스트레이션, 검색 전략, 추론 모드, 인프라 최적화, 그리고 평가 방법론 탐사를 다루는 14개의 병렬 구현 연구를 포함한다. 이러한 연구들을 7개의 이전 에이전트 벤치마크와 통합하여, 우리는 종합 점수 리더보드가 배포된 에이전트 평가를 체계적으로 과소 명세한다고 주장한다. 종합 점수에서 도출된 순위는 분포 외 설정으로 이전되지 않으며, 최근 공개 대 비공개 대회에 대한 회고적 분석은 이러한 순위 불안정성에 대한 직접적인 경험적 증거를 제공한다. 우리는 표본 내 평균이 아닌, 표본 내와 표본 외 순위 간 상관관계인 예측 타당도에 따라 설정을 순위화할 것을 제안한다. 또한, HELM과 그 이후 에이전트 시대의 후속 모델들이 간과한 배포 관련 차원을 드러내는 12계층 측정 장치를 보고한다. 이러한 입장은 명시적 임계값을 가진 세 가지 반증 가능한 분포 외 기준을 통해 구체화되며, 기존 증거는 이를 부분적으로 지지하지만 확인하기에는 너무 빈약하다. 우리는 사전 등록된 파일럿 설계와 차세대 에이전트 벤치마크가 보고해야 할 사항에 대한 현장 수준의 비전을 제시하며 마무리한다.

English

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.