LoCoBench: 複雑なソフトウェアエンジニアリングにおける長文脈大規模言語モデルのベンチマーク

要旨

数百万トークンに及ぶコンテキストウィンドウを持つ長文脈言語モデルの出現は、高度なコード理解とソフトウェア開発評価の新たな機会を創出しました。本論文では、現実的で複雑なソフトウェア開発シナリオにおいて長文脈LLMを評価するために特別に設計された包括的なベンチマーク、LoCoBenchを提案します。既存のコード評価ベンチマークが単一関数の補完や短文脈タスクに焦点を当てているのに対し、LoCoBenchは、コードベース全体を理解し、複数のファイルにわたる推論を行い、大規模ソフトウェアシステム全体でアーキテクチャの一貫性を維持することを要求する長文脈能力の重要な評価ギャップに対処します。本ベンチマークは、10のプログラミング言語にわたって体系的に生成された8,000の評価シナリオを提供し、コンテキスト長は10Kから1Mトークンまで広がり、100倍の変動幅を持つことで、現実的なソフトウェア開発設定における長文脈性能の劣化を正確に評価できます。LoCoBenchは、アーキテクチャ理解、クロスファイルリファクタリング、マルチセッション開発、バグ調査、機能実装、コード理解、統合テスト、セキュリティ分析といった、重要な長文脈能力を捉える8つのタスクカテゴリーを導入します。5段階のパイプラインを通じて、LLMに前例のない規模で複雑なコードベースについて推論することを要求する多様で高品質なシナリオを作成します。我々は、4つの次元にわたる17のメトリクスを含む包括的な評価フレームワークを導入し、そのうち8つは新規の評価メトリクスであり、LoCoBenchスコア（LCBS）として組み合わせます。最先端の長文脈モデルの評価を通じて、複雑なソフトウェア開発における長文脈理解が未解決の重大な課題であり、より多くの注目を必要としていることを示す大幅な性能ギャップが明らかになりました。LoCoBenchは以下で公開されています：https://github.com/SalesforceAIResearch/LoCoBench。

English

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

LoCoBench: 複雑なソフトウェアエンジニアリングにおける長文脈大規模言語モデルのベンチマーク

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

要旨

Support