ISO-Bench: 코딩 에이전트가 실제 추론 워크로드를 최적화할 수 있을까?

초록

우리는 실제 추론 최적화 작업에서 코딩 에이전트의 능력을 테스트하기 위한 벤치마크인 ISO-Bench를 소개합니다. 이 작업들은 가장 인기 있는 LLM 서빙 프레임워크 중 두 가지인 vLLM과 SGLang에서 가져왔습니다. 각 작업은 에이전트에게 코드베이스와 병목 현상에 대한 설명을 제공하며, 에이전트는 전문가의 인간 솔루션과 비교하여 평가되는 최적화 패치를 생성해야 합니다. 우리는 측정 가능한 성능 향상이 있는 병합된 풀 리퀘스트에서 54개의 작업을 선별했습니다. 기존 벤치마크가 런타임 기반 메트릭을 많이 사용하지만, 이러한 접근 방식은 코드 변경의 실제 의도를 파악하지 못한 채 테스트를 통과하도록 조작될 수 있습니다. 따라서 우리는 하드(실행 기반) 메트릭과 소프트(LLM 기반) 메트릭을 결합하여 완전한 평가를 위해 둘 다 필요함을 보여줍니다. 클로즈드 소스 및 오픈 소스 코딩 에이전트를 모두 평가한 결과, 단일 에이전트가 모든 코드베이스에서 우월하지 않다는 것을 발견했습니다. 놀랍게도, 에이전트들은 종종 올바른 병목 현상을 식별하지만 작동하는 솔루션을 실행하는 데는 실패합니다. 또한 동일한 기본 모델을 가진 에이전트들도 상당한 차이를 보여주며, 이는 스캐폴딩이 모델만큼 중요함을 시사합니다.

English

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

ISO-Bench: 코딩 에이전트가 실제 추론 워크로드를 최적화할 수 있을까?

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

초록

Support