ContextBench：编程智能体中上下文检索能力的基准测试框架

摘要

基于大语言模型的代码代理在自动化问题解决基准测试中展现出强劲性能，但现有评估主要关注最终任务成功率，对代理在问题解决过程中如何检索和利用代码语境的洞察有限。我们推出ContextBench——一个面向代码代理语境检索过程的评估框架。该框架包含来自8种编程语言66个代码库的1,136个问题解决任务，每个任务均辅以人工标注的黄金语境标准。我们进一步实现了自动化评估框架，可追踪代理执行轨迹并全程测量问题解决过程中的语境召回率、精确率和效率。通过ContextBench，我们评估了4个前沿大语言模型和5个代码代理。研究结果表明：复杂代理框架对语境检索的提升有限（印证代码代理领域的"苦涩教训"）；大语言模型持续呈现重召回轻精确的倾向；已探索语境与实际使用语境之间存在显著差距。ContextBench通过引入可解构问题解决过程的黄金语境中间指标，对现有端到端基准测试形成有效补充。这些语境为软件任务中引导大语言模型推理提供了宝贵的中间信号。

English

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval ("The Bitter Lesson" of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.

ContextBench：编程智能体中上下文检索能力的基准测试框架

ContextBench: A Benchmark for Context Retrieval in Coding Agents

摘要

Support