基准測試並不夠：面向生產系統中智能體模型的運行時評估RAMP

摘要

LLM代理正快速從程式碼助手演變為自主軟體工程系統。然而，現有的評估方法仍主要集中於靜態、孤立且短期的基準，無法捕捉真實生產工作流程的動態複雜性。因此，基準表現可能難以反映在涉及長執行鏈、工具互動、依賴管理及迭代反饋循環的實際運行環境下的實際能力。為此，我們提出RAMP，一個基於生產環境的基礎設施，用於評估長程軟體工程代理。RAMP建構於YatCC整合平台之上，透過標準化的編排與執行介面提供統一的運行時評估架構。RAMP引入了具有序列依賴性與複雜工具鏈互動的真實編譯器建構工作負載，同時具備階段性恢復機制，可用於分析部分工作流程失敗下的執行行為。該框架進一步納入以效用為導向的多維度指標，共同評估成果品質與流程效率。我們對15個主流模型進行運行時評估，觀察到常規孤立基準幾乎無法察覺的顯著能力下降。任務完成率在序列工作流程中逐步崩潰，從初始階段的100%降至最終階段僅20%，且所有受評模型均未能成功完成整個流程。運行時分析揭示了系統性的失敗傳播與顯著的資源效率不足，可比模型間的計算成本差異高達三個數量級。這些發現表明，RAMP將代理模型評估推向持續、運行時可觀測且基於生產環境的評估。

English

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.