BuildBench: 実世界のオープンソースソフトウェアのコンパイルにおけるLLMエージェントのベンチマーキング

要旨

オープンソースソフトウェア（OSS）プロジェクトの自動コンパイルは、重要でありながらも労力を要し、複雑なタスクであるため、LLMエージェントにとって良い挑戦課題となっている。既存の手法は手動で作成されたルールやワークフローに依存しており、カスタマイズされた設定や環境構築を必要とするOSSには適応できない。最近の大規模言語モデル（LLM）を用いた試みでは、高評価のOSSの一部に対して選択的な評価を行っており、これはOSSコンパイルの現実的な課題を過小評価している。実際には、コンパイル手順が欠落している場合や、依存関係が文書化されていない場合が多く、ビルドを成功させるためにはソースファイルの修正やビルドスクリプトの変更が必要になることもある。我々は、品質、規模、特性においてより多様なOSSを含む、より挑戦的で現実的なベンチマークであるBUILD-BENCHを提案する。さらに、BUILD-BENCHにおいて最先端の性能を発揮し、異種のOSS特性に適応可能な、強化されたビルド手順検索モジュールを備えた強力なベースラインLLMベースのエージェント、OSS-BUILD-AGENTを提案する。また、異なるコンパイル方法の設計選択とそのタスク全体への影響に関する詳細な分析を提供し、将来の進展を導くための洞察を提供する。我々は、BUILD-BENCHにおける性能が、複雑なソフトウェアエンジニアリングタスクとしてのコンパイルに取り組むエージェントの能力を忠実に反映すると信じており、このベンチマークがソフトウェア開発およびソフトウェアセキュリティ分野における下流アプリケーションに大きな影響を与えるイノベーションを促進することを期待している。

English

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.

BuildBench: 実世界のオープンソースソフトウェアのコンパイルにおけるLLMエージェントのベンチマーキング

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

要旨

Support