基于自调控模拟规划的高效智能体推理

摘要

一个智能体应如何决定何时以及如何规划？主流方法是将智能体构建为具有自适应计算能力（例如思维链）的反应式策略，通过端到端训练期望规划能力隐式涌现。由于无法控制规划的存在性、结构或深度，这类系统会显著延长推理链，导致令牌使用效率低下且无法稳定提升准确性。我们认为，高效的智能体推理需要将决策过程分解为三个系统：模拟推理（System II），通过世界模型将思考建立在未来状态预测的基础上；自我调节（System III），通过学习型配置器决定何时规划及规划深度；以及反应式执行（System I），负责细粒度的动作执行。模拟推理为跨领域任务提供统一规划能力，无需针对每个领域进行工程定制，而自我调节确保仅在必要时才调用规划器。为验证这一框架，我们开发了SR^2AM（自我调节模拟推理智能体大语言模型），将两者作为大语言模型思维链中的不同阶段实现，并以LLM作为世界模型。我们探索了两种实现方式：基于提示的多模块系统的决策记录（v0.1版本），以及从预训练推理型LLM的轨迹中重构结构化规划（v1.0版本），通过监督学习与强化学习进行训练。在数学、科学、表格分析和网络信息检索等任务中，v0.1-8B和v1.0-30B的Pass@1指标分别达到与120-355B参数系统和685B-1T参数系统相当的水平，而v1.0-30B的推理令牌消耗比同类智能体LLM减少25.8%-95.3%。强化学习使平均规划深度提升22.8%，但规划频率仅增长2.0%，表明模型学会了更长远规划而非更频繁规划。更广泛地看，这种学习型自我调节体现了一个原则：我们预期该原则将超越规划范畴，延伸至智能体如何自主管理自身学习与适应过程。

English

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.