代理式代码推理

摘要

大型语言模型（LLM）智能体能否在不执行代码的情况下探索代码库并推理代码语义？我们研究了这种被称为"智能体代码推理"的能力，并提出了半形式化推理方法：一种结构化提示技术，要求智能体构建明确的前提条件、追踪执行路径并推导形式化结论。与非结构化的思维链不同，半形式化推理具有证书式特性：智能体无法跳过用例或提出无依据的论断。我们在三项任务（补丁等价性验证、故障定位和代码问答）上进行评估，结果表明半形式化推理能持续提升所有任务的准确率。在补丁等价性任务中，精选样本的准确率从78%提升至88%，真实场景智能体生成补丁的准确率达到93%，接近无需执行的强化学习奖励信号所需的可靠性水平。在RubberDuckBench（Mohammad等人，2026）的代码问答任务中，半形式化推理达到87%的准确率。在Defects4J（Just等人，2014）的故障定位任务中，半形式化推理将Top-5准确率较标准推理提升了5个百分点。这些结果证明，结构化的智能体推理能够实现无需执行的有意义的代码语义分析，为强化学习训练流程、代码审查和静态程序分析开辟了实际应用前景。

English

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

代理式代码推理

Agentic Code Reasoning

摘要

Support