**BeyondSWE：现有代码智能体能否突破单仓库缺陷修复的局限？**

摘要

当前针对代码智能体的基准测试主要评估狭窄的、仓库特定的修复任务，忽略了现实世界中的关键挑战，如跨仓库推理、领域专业化问题解决、依赖驱动迁移及全仓库生成等。为填补这一空白，我们推出BeyondSWE综合基准，通过500个真实场景案例，从分辨率范围和知识范围两个维度拓展现有评估体系。实验结果表明存在显著能力差距：即使前沿模型成功率也停滞在45%以下，且没有单一模型能在所有任务类型中保持稳定表现。为系统研究外部知识的作用，我们开发了SearchSWE框架，将深度搜索与编码能力相结合。实验表明搜索增强带来的提升并不稳定，有时甚至会导致性能下降，这凸显了在编码任务中模拟开发者交替进行搜索与推理工作流程的难度。本研究既提供了真实且具有挑战性的评估基准，也构建了灵活框架以推动更强大代码智能体的研究发展。

English

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE：现有代码智能体能否突破单仓库缺陷修复的局限？

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

摘要

Support