ベンチマークだけでは不十分：本番システムにおけるエージェントモデルの実行時評価のためのRAMP

要旨

LLMエージェントは、コードアシスタントから自律的なソフトウェアエンジニアリングシステムへと急速に進化している。しかし、既存の評価手法は依然として、静的で孤立した短期志向のベンチマークに大きく依存しており、実運用のワークフローが持つ動的な複雑性を捉え切れていない。その結果、ベンチマークでの性能は、長い実行チェーン、ツール連携、依存関係管理、反復的なフィードバックループを含む現実的な実行環境下での実用的能力を適切に反映しない可能性がある。そこで本稿では、長期間にわたるソフトウェアエンジニアリングエージェントを評価するための、実運用に基づくインフラストラクチャであるRAMPを提案する。YatCC統合プラットフォーム上に構築されたRAMPは、標準化されたオーケストレーションおよび実行インターフェースを通じて、統一された実行時評価アーキテクチャを提供する。RAMPは、直列的な依存関係と複雑なツールチェーン連携を伴う現実的なコンパイラ構築ワークロードを導入し、さらに部分的なワークフロー障害下での実行挙動を分析するための段階的回復メカニズムを備えている。本フレームワークはさらに、成果の質とプロセスの効率を共同で評価する、実用性指向の多次元指標を取り入れている。我々は15の主流モデルに対して実行時評価を実施し、従来の孤立したベンチマークではほとんど見えない、顕著な能力低下を観察した。タスク完了率は直列的なワークフロー全体で徐々に低下し、初期段階の100%から最終段階ではわずか20%にまで落ち込み、評価した全モデルがパイプライン全体を正常に完了することはなかった。実行時分析により、系統的な障害伝播と著しいリソース非効率が明らかになり、同等のモデル間でも計算コストに最大3桁の差が生じた。これらの知見は、RAMPがエージェントモデルの評価を、継続的で実行時観測可能かつ実運用に根ざした評価へと進化させることを示唆している。

English

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.