自己調整型シミュレーション計画による効率的なエージェント推論

要旨

エージェントはいつ、どのように計画を立てるべきかという問いに対して、主流のアプローチでは適応的な計算（例：思考の連鎖）を備えた反応型ポリシーとしてエージェントを構築し、計画が暗黙的に現れることを期待してエンドツーエンドで訓練する。しかし、計画の存在、構造、または範囲を制御できないため、これらのシステムは推論の長さを大幅に増加させ、信頼性のある精度向上なしに非効率なトークン使用を引き起こす。本稿では、効率的なエージェント推論を実現するには、意思決定を3つのシステムに分解することが有益であると主張する。すなわち、世界モデルを介した将来状態予測に基づく熟考を接地するシミュレーション推論（システムII）、学習されたコンフィギュレーターを介していつ、どの程度深く計画を立てるかを決定する自己調整（システムIII）、および詳細な行動を処理する反応型実行（システムI）である。シミュレーション推論は、ドメインごとのエンジニアリングなしに多様なタスクにわたる統一的な計画を提供し、自己調整は、必要に応じてのみ計画器が起動されることを保証する。これを検証するために、SR^2AM（自己調整シミュレーション推論エージェントLLM）を開発し、両方をLLMの思考の連鎖内で別個の段階として実現し、LLMを世界モデルとして用いる。我々は2つのインスタンス化を探求する。すなわち、プロンプトによるマルチモジュールシステムから決定を記録する方式（v0.1）と、事前訓練された推論LLMのトレースから構造化された計画を再構築する方式（v1.0）であり、教師あり学習とそれに続く強化学習（RL）により訓練される。数学、科学、表形式分析、ウェブ情報検索にわたって、v0.1-8Bとv1.0-30Bはそれぞれ120-355Bおよび685B-1Tパラメータシステムと競合するPass@1を達成し、v1.0-30Bは同等のエージェントLLMと比較して25.8〜95.3%少ない推論トークンを使用する。RLは平均計画期間を22.8%増加させる一方、計画頻度はわずか2.0%しか増加せず、より頻繁に計画を立てるのではなく、より先を見越した計画を立てることを学習することが示された。より広く見れば、学習された自己調整は、計画を超えて、エージェントが自身の学習と適応をどのように統制するかにまで拡張されることが期待される原則を具体化している。

English

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.