ISO-Bench：编码智能体能否优化现实世界推理工作负载？

摘要

我们推出ISO-Bench基准测试平台，旨在通过真实场景的推理优化任务评估编程智能体的能力。这些任务源自两大主流大语言模型服务框架vLLM和SGLang，每个任务为智能体提供代码库与性能瓶颈描述，要求其生成优化补丁并与人类专家方案进行对标评估。我们从具有可量化性能提升的合并拉取请求中精选出54项任务。现有基准测试过度依赖运行时指标，此类方法可能被投机取巧通过测试却无法体现代码修改的真实意图。因此，我们结合硬性（基于执行）与软性（基于大语言模型）双重指标，证明二者对完整评估缺一不可。在评估闭源与开源编程智能体时，我们发现没有单一智能体能在所有代码库中占据绝对优势。令人惊讶的是，智能体常能准确识别瓶颈却无法实现有效解决方案。研究还表明，基于相同底层模型的智能体表现差异显著，这揭示出系统框架设计与模型本身同等重要。

English

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

ISO-Bench：编码智能体能否优化现实世界推理工作负载？

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

摘要

Support