ISO-Bench：编程智能体能否优化现实世界推理工作负载？

摘要

我们推出ISO-Bench基准测试，旨在通过真实场景的推理优化任务评估编程智能体的能力。这些任务源自两大主流LLM服务框架vLLM和SGLang，每个任务为智能体提供代码库与瓶颈描述，要求其提交优化补丁并与人类专家方案进行对标评估。我们从已合并的拉取请求中精选出54个具有可量化性能提升的任务。现有基准测试过度依赖运行时指标，这种方法可能被投机取巧通过测试而无法捕捉代码变更的实际意图。为此，我们结合硬性（基于执行）与软性（基于LLM）的双重指标，证明二者对完整评估缺一不可。在评估闭源与开源编程智能体时，我们发现没有单一智能体能在所有代码库中占据绝对优势。令人惊讶的是，智能体常能准确定位瓶颈却无法给出可行解决方案。研究还表明，基于相同底层模型的智能体表现差异显著，这提示脚手架设计与模型本身同等重要。

English

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.