BeyondSWE: 現在のコードエージェントは単一リポジトリのバグ修正を超えて生き残れるか？

要旨

現在のコードエージェントのベンチマークは、主に限定的なリポジトリ固有の修正を評価するものであり、リポジトリ横断的な推論、ドメイン特化型の問題解決、依存関係駆動の移行、フルリポジトリ生成といった現実世界の重要な課題を見落としている。この課題を解決するため、我々はBeyondSWEを提案する。これは解像度スコープと知識スコープの2軸に沿って既存の評価を拡大し、4つの異なる設定にわたる500の実世界インスタンスを使用する包括的ベンチマークである。実験結果は顕著な能力ギャップを明らかにする：最先端モデルでさえ成功率45%未満で頭打ちとなり、単一のモデルがタスクタイプを横断して一貫した性能を発揮するものはない。外部知識の役割を体系的に調査するため、深層検索とコーディング能力を統合するSearchSWEフレームワークを開発した。実験により、検索拡張は一貫した効果をもたらさず、場合によっては性能を低下させうることが示され、コーディングタスクにおける検索と推論を交互に行う開発者類似のワークフローを模倣する難しさが浮き彫りとなった。本研究は、より高度なコードエージェントに向けた研究を推進するため、現実的で挑戦的な評価ベンチマークと柔軟なフレームワークの両方を提供する。

English

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE: 現在のコードエージェントは単一リポジトリのバグ修正を超えて生き残れるか？

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

要旨

Support