Naar een Neurale Debugger voor Python

Samenvatting

Het trainen van grote taalmodellen (LLM's) op Python-uitvoeringstraces verankert ze in code-uitvoering en stelt ze in staat om de regel-voor-regel-uitvoering van volledige Python-programma's te voorspellen, waardoor ze in feite worden getransformeerd tot neurale interpreters (FAIR CodeGen Team et al., 2025). Ontwikkelaars voeren programma's echter zelden stap voor stap uit; in plaats daarvan gebruiken ze debuggers om de uitvoering op bepaalde breekpunten te stoppen en alleen relevante delen door te lopen terwijl ze programmavariabelen inspecteren of aanpassen. Bestaande neurale interpreter-benaderingen missen dergelijke interactieve controle. Om deze beperking aan te pakken, introduceren we neurale debuggers: taalmodellen die traditionele debuggers nabootsen en operaties ondersteunen zoals *stepping into*, *over* of *out* van functies, evenals het instellen van breekpunten op specifieke broncoderegels. Wij tonen aan dat neurale debuggers – verkregen via *fine-tuning* van grote LLM's of *pre-training* van kleinere modellen vanaf nul – zowel voorwaartse uitvoering (het voorspellen van toekomstige staten en outputs) als inverse uitvoering (het afleiden van voorgaande staten of inputs) betrouwbaar kunnen modelleren, geconditioneerd op debugger-acties. Geëvalueerd op CruxEval behalen onze modellen sterke prestaties voor zowel output- als inputvoorspellingstaken, wat duidt op robuuste conditionele uitvoeringsmodellering. Ons werk zet de eerste stappen naar toekomstige *agentic* coderingssystemen waarin neurale debuggers dienen als een wereldmodel voor gesimuleerde debug-omgevingen, waarbij ze uitvoeringsfeedback verschaffen of *agents* in staat stellen om te interageren met echte debugtools. Deze mogelijkheid legt de basis voor krachtigere codegeneratie, programma-begrip en geautomatiseerd debuggen.

English

Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.

Naar een Neurale Debugger voor Python

Towards a Neural Debugger for Python

Samenvatting

Support