多模态人工智能代理的预见性规划

摘要

儘管多模態智能體的最新進展已提升計算機使用交互與工具運用能力，但現有系統多數仍停留於被動響應模式，僅針對孤立動作進行優化，缺乏對未來狀態或長期目標的推理能力。這種局限性制約了規劃連貫性，使智能體難以可靠解決高層次、多步驟任務。我們提出TraceR1——一個兩階段強化學習框架，通過在執行前預測短週期軌跡來顯式訓練預見性推理能力。第一階段採用軌跡級強化學習，其獎勵機制確保預測動作序列的全局一致性；第二階段實施實證強化微調，利用凍結工具智能體的執行反饋來提升步驟級精度與可執行性。TraceR1在七項基準測試中進行評估，涵蓋在線/離線計算機使用基準及多模態工具推理任務，結果顯示其在規劃穩定性、執行魯棒性和泛化能力方面相較被動響應與單階段基線模型實現顯著提升。這些成果證實，預見性軌跡推理是構建能夠在複雜現實環境中有效推理、規劃與行動的多模態智能體的關鍵原則。

English

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

多模态人工智能代理的预见性规划

Anticipatory Planning for Multimodal AI Agents

摘要

Support