자기 조절 시뮬레이션 계획을 통한 효율적인 에이전트 추론

초록

에이전트는 언제, 어떻게 계획을 수립해야 하는가? 지배적인 접근 방식은 적응형 연산(예: 체인-오브-생각)을 갖춘 반응적 정책으로 에이전트를 구축하고, 계획이 암묵적으로 나타나기를 기대하며 종단 간 학습을 수행한다. 이러한 시스템은 계획의 존재 여부, 구조 또는 범위에 대한 통제 없이 추론 길이를 극적으로 증가시켜, 신뢰할 수 있는 정확도 향상 없이 비효율적인 토큰 사용을 초래한다. 본 논문은 효율적인 에이전트 추론이 의사결정을 세 가지 시스템으로 분해함으로써 이점을 얻는다고 주장한다: 세계 모델을 통한 미래 상태 예측에 숙의를 근거하는 시뮬레이션 추론(시스템 II); 학습된 구성자를 통해 언제, 얼마나 깊이 계획할지를 결정하는 자기 조절(시스템 III); 세부적인 행동을 처리하는 반응적 실행(시스템 I). 시뮬레이션 추론은 도메인별 엔지니어링 없이 다양한 작업에 걸쳐 통합된 계획을 제공하며, 자기 조절은 계획자가 필요할 때만 호출되도록 보장한다. 이를 검증하기 위해 SR^2AM(자기 조절 시뮬레이션 추론 에이전트 LLM)을 개발하여, LLM을 세계 모델로 사용하면서 LLM의 체인-오브-생각 내에서 두 요소를 별도의 단계로 구현했다. 우리는 두 가지 구현 방식을 탐구한다: 프롬프트 기반 다중 모듈 시스템(v0.1)에서 결정을 기록하는 방식과, 사전 학습된 추론 LLM의 궤적에서 구조화된 계획을 재구성하는 방식(v1.0)으로, 지도 학습 후 강화 학습을 통해 훈련되었다. 수학, 과학, 표 분석, 웹 정보 탐색 전반에 걸쳐, v0.1-8B와 v1.0-30B는 각각 120-355B 및 685B-1T 파라미터 시스템과 경쟁력 있는 Pass@1을 달성했으며, v1.0-30B는 유사한 에이전트 LLM보다 25.8-95.3% 적은 추론 토큰을 사용했다. 강화 학습은 평균 계획 범위를 22.8% 증가시킨 반면, 계획 빈도는 2.0%만 증가하여, 더 자주 계획하기보다는 더 멀리 계획하는 법을 학습함을 보여준다. 더 넓게 보면, 학습된 자기 조절은 계획을 넘어 에이전트가 자신의 학습과 적응을 어떻게 통제할지에 대한 원칙을 구체화한다.

English

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.