基准测试还不够：用于生产系统中智能体模型运行时评估的RAMP

摘要

大语言模型智能体正迅速从编码助手发展为自主软件工程系统。然而，现有的评估方法仍主要集中于静态、孤立且短视的基准测试，无法捕捉真实生产工作流的动态复杂性。因此，基准性能可能难以反映在涉及长执行链、工具交互、依赖管理和迭代反馈循环的真实运行时环境下的实际能力。为此，我们提出RAMP，一个基于真实生产的用于评估长周期软件工程智能体的基础设施。RAMP基于YatCC集成平台，通过标准化的编排和执行接口提供统一的运行时评估架构。RAMP引入了具有串行依赖关系和复杂工具链交互的真实编译器构建工作负载，同时配备阶段性恢复机制，用于分析部分工作流失败下的执行行为。该框架进一步整合了面向效用的多维指标，共同评估结果质量和过程效率。我们对15个主流模型进行了运行时评估，观察到传统孤立基准测试难以察觉的显著能力退化。任务完成率在串行工作流中逐步崩溃，从初始阶段的100%降至最终阶段的仅20%，且评估模型均未能成功完成整个流水线。运行时分析揭示了系统性故障传播和显著的资源低效，同类模型间的计算成本差异高达三个数量级。这些发现表明RAMP推动智能体模型评估向持续、运行时可观察且基于真实生产的方向发展。

English

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.