面向Python的神经调试器研究

摘要

通过对Python执行轨迹训练大语言模型，可使其掌握代码执行逻辑，实现对整个Python程序的逐行执行预测，从而将其转化为神经解释器（FAIR CodeGen团队等，2025）。然而开发者很少逐步执行程序，而是通过调试器在断点处暂停，仅单步跟踪相关代码段并检查或修改变量。现有神经解释器方法缺乏此类交互控制能力。为突破这一局限，我们提出神经调试器：这种语言模型能模拟传统调试器，支持单步进入、跳过或跳出函数等操作，以及在特定源码行设置断点。实验表明，通过微调大型LLM或从头预训练小型模型获得的神经调试器，能够可靠地建模正向执行（预测未来状态与输出）和逆向执行（推断先前状态或输入），并受调试操作调控。在CruxEval基准测试中，我们的模型在输出与输入预测任务上均表现优异，展现出强大的条件执行建模能力。本研究为未来智能编码系统迈出重要一步：神经调试器可作为模拟调试环境的世界模型，提供执行反馈或帮助智能体与真实调试工具交互。这一能力为更强大的代码生成、程序理解和自动化调试奠定了基石。

English

Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.