ChatPaper.aiChatPaper

ProAct:互動環境中的前瞻性智能體決策

ProAct: Agentic Lookahead in Interactive Environments

February 5, 2026
作者: Yangbin Yu, Mingyu Yang, Junyou Li, Yiming Gao, Feiyu Liu, Yijun Yang, Zichuan Lin, Jiafei Lyu, Yicheng Liu, Zhicong Lu, Deheng Ye, Jie Jiang
cs.AI

摘要

現有的大型語言模型(LLM)智能體在需要長程規劃的互動環境中表現不佳,主要源於模擬未來狀態時產生的誤差累積問題。為解決此難題,我們提出ProAct框架,通過兩階段訓練範式使智能體能夠內化精準的前向推理能力。首先,我們引入基於環境搜索的軌跡進行監督式微調的「紮根前瞻蒸餾法」(GLAD),將複雜的搜索樹壓縮為簡潔的因果推理鏈,使智能體在無需推理階段搜索計算負擔的情況下掌握前瞻邏輯。其次,為進一步提升決策精度,我們提出即插即用的輔助價值估計器「蒙地卡羅評鑑器」(MC-Critic),該組件專為增強PPO、GRPO等策略梯度算法而設計。通過輕量級環境推演來校準價值估計,MC-Critic能提供低方差信號,在無需依賴高成本基於模型的價值近似下實現穩定策略優化。在隨機性環境(如2048)與確定性環境(如倉庫番)中的實驗表明,ProAct能顯著提升規劃準確度。值得注意的是,採用ProAct訓練的40億參數模型不僅超越所有開源基準線,更可與最先進的閉源模型媲美,同時展現出對未見環境的強健泛化能力。程式碼與模型已開源於:https://github.com/GreatX3/ProAct
English
Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
PDF192February 7, 2026