从代码到正确性：通过分层调试来完成代码生成的最后一英里

摘要

尽管大型语言模型在代码生成方面取得了重大进展，但生成代码的通过率常常受制于微妙的错误，通常需要人工干预才能通过测试，尤其是对于复杂问题。现有基于LLM的调试系统将生成的程序视为单一单元，未能解决多个粒度级别的错误，从低级语法错误到高级算法缺陷。本文介绍了多粒度调试器（MGDebugger），这是一种分层代码调试器，通过在不同粒度级别上隔离、识别和解决错误。MGDebugger将有问题的代码分解为子函数的分层树结构，每个级别代表特定粒度的错误。在调试过程中，它分析每个子函数并以自下而上的方式迭代解决错误。为了有效测试每个子函数，我们提出了一个LLM模拟的Python执行器，它跟踪代码执行并记录重要变量状态，以准确定位错误。大量实验证明，MGDebugger优于现有的调试系统，在HumanEval中的种子生成准确性方面提高了18.9％，在HumanEvalFix中的修复成功率达到了97.6％。此外，MGDebugger有效修复了不同类别和难度级别的错误，展示了其稳健性和有效性。

English

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

从代码到正确性：通过分层调试来完成代码生成的最后一英里

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

摘要

Support