ISO-Bench: コーディングエージェントは実世界の推論ワークロードを最適化できるか？

要旨

我々は、実世界の推論最適化タスクにおけるコーディングエージェントの能力を評価するためのベンチマーク「ISO-Bench」を提案する。本ベンチマークのタスクは、最も広く利用されているLLMサービスフレームワークの一つであるvLLMとSGLangから採用した。各タスクでは、エージェントにコードベースとボトルネックの説明が提供され、エージェントは専門家による人間の解決策と比較評価される最適化パッチを生成しなければならない。我々は、測定可能な性能向上が確認されたマージ済みプルリクエストから54のタスクを精選した。既存のベンチマークは実行時間ベースの指標を多用する傾向があるが、このようなアプローチでは、コード変更の真の意図を捉えずにテストを通過する抜け道が生じうる。そこで我々は、ハード（実行ベース）指標とソフト（LLMベース）指標の両方を組み合わせ、完全な評価には両者が不可欠であることを示す。クローズドソース及びオープンソースのコーディングエージェント双方を評価した結果、単一のエージェントが全てのコードベースで優位に立つことはないことが分かった。驚くべきことに、エージェントはしばしば正しいボトルネックを特定するものの、動作する解決策の実行には失敗する。また、基盤モデルが同一であるエージェント間でも性能に大きな差が生じることから、モデル自体と同様に、周辺の支援構造（スキャフォールディング）の重要性が示唆された。

English

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

ISO-Bench: コーディングエージェントは実世界の推論ワークロードを最適化できるか？

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

要旨

Support