透過自我調節模擬規劃的高效智能體推理

摘要

智能體應如何決定何時及如何規劃？主流方法是將智能體構建為具備適應性計算的反應式策略（例如思維鏈），並通過端到端訓練，期望規劃能力自然湧現。然而，在缺乏對規劃的存在性、結構或深度的明確控制下，此類系統會大幅增加推理長度，導致代幣使用效率低落，且無法保證準確率的可靠提升。我們認為，高效能的智能推理應將決策過程分解為三個系統：模擬推理（系統II）透過世界模型基於未來狀態預測來紮根深思熟慮；自我調節（系統III）藉由學習型配置器決定何時及多深入地進行規劃；以及反應執行（系統I）負責細粒度行動。模擬推理能在無需特定領域工程的情況下，跨多樣任務提供統一的規劃機制，而自我調節則確保規劃器僅在必要時被啟用。為驗證此觀點，我們開發了SR^2AM（Self-Regulated Simulative Reasoning Agentic LLM），將兩者實作為大型語言模型思維鏈中的不同階段，並以LLM本身作為世界模型。我們探索了兩種實現方式：從提示式多模組系統記錄決策（v0.1），以及從預先訓練的推理型LLM的軌跡中重建結構化規劃（v1.0），並透過監督學習結合強化學習進行訓練。在數學、科學、表格分析及網路資訊搜尋等任務中，v0.1-8B與v1.0-30B分別在Pass@1指標上達到與120-355B及685B-1T參數系統相當的表現；同時，v1.0-30B的推理代幣使用量比同級別的智能LLM減少了25.8%至95.3%。強化學習使平均規劃長度增加22.8%，但規劃頻率僅上升2.0%，顯示其學會的是更長遠的規劃，而非更頻繁的規劃。更廣泛而言，學習型自我調節體現了一項原則，我們預期此原則將超越規劃範疇，擴展至智能體如何管理自身的學習與適應過程。

English

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.