Python向けニューラルデバッガーの開発に向けて

要旨

大規模言語モデル（LLM）をPythonの実行トレースで学習させることで、コード実行の基盤を構築し、Pythonプログラム全体の行単位での実行予測を可能とする。これは実質的にニューラルインタプリタ（FAIR CodeGen Team et al., 2025）への転換を意味する。しかし、開発者はプログラムを段階的に実行することは稀であり、代わりにデバッガを使用して特定のブレークポイントで実行を停止し、プログラム変数を検査または修正しながら関連部分のみをステップ実行する。既存のニューラルインタプリタ手法には、このような対話的制御が欠けている。この限界に対処するため、我々はニューラルデバッガを提案する。これは従来のデバッガをエミュレートする言語モデルであり、ステップイン、ステップオーバー、ステップアウトといった操作や、特定のソース行へのブレークポイント設定をサポートする。大規模LLMのファインチューニング、または小規模モデルのスクラッチからの事前学習により獲得したニューラルデバッガが、デバッガ操作を条件として、順方向実行（将来の状態や出力の予測）と逆方向実行（過去の状態や入力の推論）の両方を確実にモデル化できることを示す。CruxEvalによる評価では、出力予測タスクと入力予測タスクの両方で高い性能を達成し、頑健な条件付き実行モデリングを実証した。本研究は、ニューラルデバッガが模擬デバッグ環境における世界モデルとして機能し、実行フィードバックを提供したり、エージェントが実際のデバッグツールと対話することを可能にする、将来のエージェント型コーディングシステムへの第一歩である。この能力は、より強力なコード生成、プログラム理解、自動デバッグの基盤を築くものである。

English

Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.

Python向けニューラルデバッガーの開発に向けて

Towards a Neural Debugger for Python

要旨

Support