マルチモーダルAIエージェントのための予測的計画

要旨

マルチモーダルエージェントの最近の進歩は、コンピュータ利用インタラクションやツール使用を改善してきたが、既存システムの大半は依然として反応的であり、将来の状態や長期的目標についての推論を行わずに行動を個別に最適化する。これにより計画の一貫性が制限され、高水準なマルチステップタスクを確実に解決することが妨げられている。本論文では、実行前に短期軌道を予測することで予測的推論を明示的に訓練する二段階強化学習フレームワーク「TraceR1」を提案する。第一段階では、予測された一連の行動全体のグローバルな一貫性を強化する報酬を用いた軌道レベル強化学習を実行する。第二段階では、凍結されたツールエージェントからの実行フィードバックを用いて、ステップレベルの精度と実行可能性を洗練させる、接地された強化学習ファインチューニングを適用する。TraceR1は、オンラインコンピュータ利用、オフラインコンピュータ利用ベンチマーク、マルチモーダルツール使用推論タスクを含む7つのベンチマークで評価され、計画の安定性、実行の堅牢性、一般化において、反応的および単一段階のベースラインを大幅に上回る改善を達成した。これらの結果は、予測的軌道推論が、複雑な実世界環境において効果的に推論、計画、行動できるマルチモーダルエージェントを構築するための重要な原理であることを示している。

English

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

マルチモーダルAIエージェントのための予測的計画

Anticipatory Planning for Multimodal AI Agents

要旨

Support