BuildBench: 실세계 오픈소스 소프트웨어 컴파일 작업에서의 LLM 에이전트 벤치마킹

초록

오픈소스 소프트웨어(OSS) 프로젝트를 자동으로 컴파일하는 작업은 중요하면서도 노동 집약적이고 복잡한 과제로, 이는 LLM 에이전트에게 적합한 도전 과제가 됩니다. 기존의 방법들은 수동으로 정리된 규칙과 워크플로우에 의존하는데, 이는 맞춤형 설정이나 환경 구축이 필요한 OSS에 적응할 수 없습니다. 최근 대형 언어 모델(LLM)을 사용한 시도들은 높은 평가를 받은 OSS의 일부에 대해 선택적 평가를 수행했는데, 이는 OSS 컴파일의 실제적인 도전 과제를 과소평가하는 방식입니다. 실제로, 컴파일 지침이 누락된 경우가 많고, 의존성이 문서화되지 않았으며, 성공적인 빌드를 위해 소스 파일을 패치하거나 빌드 스크립트를 수정해야 하는 경우도 있습니다. 우리는 더 다양하고 품질, 규모, 특성이 다양한 OSS로 구성된 더 도전적이고 현실적인 벤치마크인 BUILD-BENCH를 제안합니다. 또한, BUILD-BENCH에서 최신 기술을 달성하고 다양한 OSS 특성에 적응할 수 있는 향상된 빌드 지침 검색 모듈을 갖춘 강력한 LLM 기반 에이전트인 OSS-BUILD-AGENT를 제안합니다. 우리는 또한 다양한 컴파일 방법 설계 선택과 이들이 전체 작업에 미치는 영향에 대한 상세한 분석을 제공하여, 향후 발전을 이끌 수 있는 통찰력을 제공합니다. 우리는 BUILD-BENCH에서의 성능이 복잡한 소프트웨어 엔지니어링 작업으로서의 컴파일 능력을 충실히 반영할 수 있다고 믿으며, 이 벤치마크가 소프트웨어 개발 및 소프트웨어 보안 분야의 다운스트림 애플리케이션에 상당한 영향을 미치는 혁신을 촉진할 것이라고 기대합니다.

English

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.

BuildBench: 실세계 오픈소스 소프트웨어 컴파일 작업에서의 LLM 에이전트 벤치마킹

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

초록

Support