BuildBench:評估LLM代理在編譯現實世界開源軟體上的表現
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
September 27, 2025
作者: Zehua Zhang, Ati Priya Bajaj, Divij Handa, Siyu Liu, Arvind S Raj, Hongkai Chen, Hulin Wang, Yibo Liu, Zion Leonahenahe Basque, Souradip Nath, Vishal Juneja, Nikhil Chapre, Yan Shoshitaishvili, Adam Doupé, Chitta Baral, Ruoyu Wang
cs.AI
摘要
自動編譯開源軟體(OSS)專案是一項至關重要、勞動密集且複雜的任務,這使其成為大型語言模型(LLM)代理的一個良好挑戰。現有方法依賴於手動策劃的規則和工作流程,無法適應需要自定義配置或環境設置的OSS。近期使用大型語言模型(LLMs)的嘗試僅對一部分高評分OSS進行選擇性評估,這種做法低估了OSS編譯的實際挑戰。實際上,編譯指令常常缺失,依賴項未記錄,成功的構建甚至可能需要修補源文件或修改構建腳本。我們提出了一個更具挑戰性和現實性的基準測試,BUILD-BENCH,包含在質量、規模和特性上更加多樣化的OSS。此外,我們提出了一個基於LLM的強基線代理,OSS-BUILD-AGENT,這是一個有效的系統,具有增強的構建指令檢索模塊,在BUILD-BENCH上達到了最先進的性能,並能適應異質的OSS特性。我們還提供了關於不同編譯方法設計選擇及其對整個任務影響的詳細分析,為未來的進展提供了指導。我們相信,在BUILD-BENCH上的性能能夠真實反映代理處理編譯這一複雜軟體工程任務的能力,因此,我們的基準測試將激發創新,對軟體開發和軟體安全領域的下游應用產生重大影響。
English
Automatically compiling open-source software (OSS) projects is a vital,
labor-intensive, and complex task, which makes it a good challenge for LLM
Agents. Existing methods rely on manually curated rules and workflows, which
cannot adapt to OSS that requires customized configuration or environment
setup. Recent attempts using Large Language Models (LLMs) used selective
evaluation on a subset of highly rated OSS, a practice that underestimates the
realistic challenges of OSS compilation. In practice, compilation instructions
are often absent, dependencies are undocumented, and successful builds may even
require patching source files or modifying build scripts. We propose a more
challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more
diverse in quality, scale, and characteristics. Furthermore, we propose a
strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with
enhanced build instruction retrieval module that achieves state-of-the-art
performance on BUILD-BENCH and is adaptable to heterogeneous OSS
characteristics. We also provide detailed analysis regarding different
compilation method design choices and their influence to the whole task,
offering insights to guide future advances. We believe performance on
BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as
a complex software engineering tasks, and, as such, our benchmark will spur
innovation with a significant impact on downstream applications in the fields
of software development and software security.