CodeCircuit：基於歸因圖推斷LLM生成程式碼正確性的研究

摘要

当前代码验证的主流范式严重依赖外部机制——例如基于执行的单元测试或辅助性LLM评判器——这些方法往往需要大量人力，或受限于评判模型自身的能力。这引发了一个基础性却尚未被探索的问题：能否仅从大语言模型的内部计算结构来评估其功能正确性？我们的核心目标是探究模型在代码生成过程中，其神经动力学是否编码了可内部解码且能预测逻辑有效性的信号。受机制可解释性研究的启发，我们提出将代码验证视为机制诊断任务，把模型的显式算法轨迹映射为行级归因图。通过解构复杂的残差流，我们试图在模型内部电路中识别出区分正确推理与逻辑错误的结构特征。跨Python、C++和Java的实证分析表明，内在正确性信号在不同语法体系下均保持稳健。从这些内部图谱提取的拓扑特征比表面启发式方法更能可靠预测代码正确性，并能实现针对性因果干预以修正错误逻辑。这些发现确立了内部自省作为验证生成代码的可解码属性。项目代码详见https://github.com/bruno686/CodeCircuit。

English

Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are often labor-intensive or limited by the judging model's own capabilities. This raises a fundamental, yet unexplored question: Can an LLM's functional correctness be assessed purely from its internal computational structure? Our primary objective is to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation. Inspired by mechanistic interpretability, we propose to treat code verification as a mechanistic diagnostic task, mapping the model's explicit algorithmic trajectory into line-level attribution graphs. By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure within the model's internal circuits. Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes. Topological features from these internal graphs predict correctness more reliably than surface heuristics and enable targeted causal interventions to fix erroneous logic. These findings establish internal introspection as a decodable property for verifying generated code. Our code is at https:// github.com/bruno686/CodeCircuit.

CodeCircuit：基於歸因圖推斷LLM生成程式碼正確性的研究

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs

摘要

Support