精准调试基准:你的模型是在调试还是重构?
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
April 19, 2026
作者: Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia
cs.AI
摘要
与代码补全不同,调试需要定位故障并实施针对性修改。我们发现前沿大语言模型在调试过程中常会重新生成正确但过度修正的解决方案。为评估大语言模型与精确调试的差距,我们提出精确调试基准(PDB)框架,该框架可将任意编程数据集自动转化为支持精度感知评估的调试基准。PDB通过合成经过验证的原子级缺陷并将其组合成多缺陷程序,从而生成含缺陷程序。我们定义了两项新颖指标:编辑级精度与缺陷级召回率,分别衡量必要修改的执行比例和已修复缺陷的覆盖范围。我们发布了两项评估基准:针对单行缺陷的PDB-Single-Hard与针对多行缺陷的PDB-Multi。实验表明,GPT-5.1-Codex和DeepSeek-V3.2-Thinking等前沿模型的单元测试通过率超过76%,但即使明确要求执行最小化调试,其精度仍低于45%。最后我们证明,迭代式和智能体调试策略并未显著提升精度或召回率,这凸显了重新思考编码模型后训练流程的必要性。
English
Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.