BeyondSWE：現行程式碼代理能否超越單一儲存庫的錯誤修復？

摘要

當前針對程式碼代理的基準測試主要聚焦於倉庫內部的局部修復評估，未能涵蓋真實場景中的關鍵挑戰，例如跨倉庫推理、領域專精問題解決、依賴驅動的遷移以及完整倉庫生成。為填補此空白，我們推出BeyondSWE——一個沿著解析範圍與知識範圍雙維度拓展的綜合基準測試，採用涵蓋四種不同場景的500個真實案例進行評估。實驗結果顯示存在顯著的能力差距：即便前沿模型的成功率也停滯在45%以下，且沒有任何單一模型能在所有任務類型中保持穩定表現。為系統性探究外部知識的作用，我們開發了SearchSWE框架，將深度搜索與編碼能力相結合。實驗表明，搜索增強策略帶來的效能提升並不穩定，甚至可能導致性能下降，這凸顯了在編碼任務中模擬開發者交錯進行搜索與推理工作流程的難度。本研究不僅提供了具現實意義的挑戰性評估基準，更提出了一個靈活框架，以推動建構更強大程式碼代理的相關研究。

English

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE：現行程式碼代理能否超越單一儲存庫的錯誤修復？

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

摘要

Support