CodeCircuit: Op Weg naar het Afleiden van de Correctheid van door LLM gegenereerde Code via Attributiegrafen

Samenvatting

Huidige paradigma's voor codeverificatie zijn sterk afhankelijk van externe mechanismen - zoals op uitvoering gebaseerde unittests of aanvullende LLM-beoordelaars - die vaak arbeidsintensief zijn of beperkt worden door de capaciteiten van het beoordelende model zelf. Dit roept een fundamentele, maar nog ononderzochte vraag op: Kan de functionele correctheid van een LLM uitsluitend worden beoordeeld op basis van zijn interne computationele structuur? Ons primaire doel is te onderzoeken of de neurale dynamiek van het model intern decodeerbare signalen bevat die voorspellend zijn voor logische geldigheid tijdens codegeneratie. Geïnspireerd door mechanistische interpreteerbaarheid stellen wij voor om codeverificatie te behandelen als een mechanistische diagnostische taak, waarbij de expliciete algoritmische trajectorie van het model wordt gemapt naar attributiegrafieken op regelniveau. Door complexe residuele stromen te decomponeren, streven we ernaar de structurele signaturen te identificeren die correcte redenering onderscheiden van logisch falen binnen de interne circuits van het model. Analyse over Python, C++ en Java bevestigt dat intrinsieke correctheidssignalen robuust zijn over diverse syntaxen. Topologische kenmerken van deze interne grafieken voorspellen correctheid betrouwbaarder dan oppervlakkige heuristieken en maken gerichte causale interventies mogelijk om foutieve logica te herstellen. Deze bevindingen vestigen interne introspectie als een decodeerbare eigenschap voor het verifiëren van gegenereerde code. Onze code staat op https://github.com/bruno686/CodeCircuit.

English

Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are often labor-intensive or limited by the judging model's own capabilities. This raises a fundamental, yet unexplored question: Can an LLM's functional correctness be assessed purely from its internal computational structure? Our primary objective is to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation. Inspired by mechanistic interpretability, we propose to treat code verification as a mechanistic diagnostic task, mapping the model's explicit algorithmic trajectory into line-level attribution graphs. By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure within the model's internal circuits. Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes. Topological features from these internal graphs predict correctness more reliably than surface heuristics and enable targeted causal interventions to fix erroneous logic. These findings establish internal introspection as a decodable property for verifying generated code. Our code is at https:// github.com/bruno686/CodeCircuit.

CodeCircuit: Op Weg naar het Afleiden van de Correctheid van door LLM gegenereerde Code via Attributiegrafen

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs

Samenvatting

Support