에이전트 코드 추론

초록

LLM 에이전트가 코드를 실행하지 않고도 코드베이스를 탐색하고 코드 의미를 추론할 수 있을까? 우리는 이러한 능력을 '에이전트 코드 추론(agentic code reasoning)'이라고 명명하고 이를 연구하며, 반형식적 추론(semi-formal reasoning) 방법론을 소개한다. 이는 구조화된 프롬프팅 방법론으로, 에이전트가 명시적인 전제를 구성하고 실행 경로를 추적하며 형식적인 결론을 도출하도록 요구한다. 비구조화된 사고의 연쇄(chain-of-thought)와 달리, 반형식적 추론은 일종의 증명서 역할을 하여 에이전트가 경우의 수를 생략하거나 근거 없는 주장을 할 수 없게 한다. 세 가지 과제(패치 등가성 검증, 결함 위치 추적, 코드 질의 응답)에 걸쳐 평가한 결과, 반형식적 추론이 모든 과제에서 일관되게 정확도를 향상시키는 것으로 나타났다. 패치 등가성의 경우, 선별된 예시에서 정확도가 78%에서 88%로 향상되었으며, 실제 에이전트 생성 패치에서는 93%에 달해 실행 없이 RL 보상 신호를 생성하는 데 필요한 신뢰도에 근접했다. RubberDuckBench(Mohammad et al., 2026)의 코드 질의 응답 과제에서는 반형식적 추론이 87%의 정확도를 달성했다. Defects4J(Just et al., 2014)의 결함 위치 추적 과제에서는 표준 추론 대비 상위 5개 정확도(Top-5 accuracy)가 5%p 향상되었다. 이러한 결과는 구조화된 에이전트 추론이 실행 없이도 의미 있는 코드 의미론 분석을 가능하게 하며, 이는 RL 훈련 파이프라인, 코드 리뷰, 정적 프로그램 분석 등에 실용적으로 적용될 수 있음을 보여준다.

English

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

에이전트 코드 추론

Agentic Code Reasoning

초록

Support