多模态AI代理的前瞻性规划

摘要

尽管多模态智能体在计算机交互与工具使用方面取得了新进展，但现有系统大多仍停留于被动响应模式，仅针对孤立动作进行优化，缺乏对未来状态或长期目标的推理能力。这种局限性导致规划连贯性不足，难以可靠完成高层次、多步骤任务。我们提出TraceR1双阶段强化学习框架，通过在执行前预测短期轨迹来显式训练前瞻推理能力。第一阶段实施轨迹级强化学习，其奖励机制确保预测动作序列的全局一致性；第二阶段进行接地强化微调，利用冻结工具智能体的执行反馈来提升步骤级精度与可执行性。在涵盖在线/离线计算机使用基准及多模态工具推理任务的七项测试中，TraceR1在规划稳定性、执行鲁棒性和泛化能力上较被动响应及单阶段基线模型实现显著提升。这些结果表明，前瞻轨迹推理是构建能够有效推理、规划并作用于复杂现实环境的多模态智能体的关键原则。

English

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

多模态AI代理的前瞻性规划

Anticipatory Planning for Multimodal AI Agents

摘要

Support