ToolCUA: コンピュータ利用エージェントのための最適なGUIツールパスオーケストレーションを目指して

要旨

コンピュータ利用エージェント (CUA) は、クリックやタイピングなどの原子的なGUIアクションと、APIベースのファイル操作などの高レベルのツール呼び出しの両方を介して動作できるが、このハイブリッドな行動空間により、GUIアクションを継続するかツールへ切り替えるかが不確かになり、最適でない実行経路を生じることが多い。この困難は、高品質なインターリーブ型GUI-ツール軌跡の希少性、実際のツール軌跡収集にかかるコストと脆弱性、そしてGUI-ツール経路選択に対する軌跡レベルの教師信号の欠如に起因する。本論文では、段階的学習パラダイムを通じて最適なGUI-ツール経路選択を学習するように設計されたエンドツーエンドのエージェント、ToolCUAを提案する。まず、豊富な静的GUI軌跡を再利用し、根拠付きツールライブラリを合成するインターリーブ型GUI-ツール軌跡スケーリングパイプラインを導入する。これにより、手動での設計や実際のツール軌跡収集を必要とせずに多様なGUI-ツール軌跡を可能にする。次に、Tool-Bootstrapped GUI RFT（ウォームアップSFTとシングルターンRLを組み合わせた手法）を実行し、重要なGUI-ツール切り替えポイントでの判断を改善する。最後に、高忠実度のGUI-ツール環境において、適切なツール使用とより短い実行経路を促進するツール効率的経路報酬に導かれたオンラインエージェントRLを用いてToolCUAを最適化する。OSWorld-MCPでの実験では、ToolCUAは46.85%の精度を達成し、ベースラインから約66%の相対的改善を示し、同等規模のモデルの中で新たな最先端を確立した。また、GUIのみの設定から3.9%の改善を示し、効果的なGUI-ツール連携を実証している。さらに、この結果はハイブリッド行動空間での訓練が実世界のデジタルエージェントにとって有望なパラダイムであることを示唆している。ソースコードはこちら：https://x-plug.github.io/ToolCUA/

English

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

ToolCUA: コンピュータ利用エージェントのための最適なGUIツールパスオーケストレーションを目指して

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

要旨

Support