ToolCUA：面向计算机使用代理的最佳GUI工具路径编排

摘要

計算機使用代理（CUA）可同時透過原子化 GUI 動作（如點擊與輸入）及高階工具呼叫（如基於 API 的檔案操作）執行任務，然而此混合行動空間常使代理在「繼續執行 GUI 動作」或「切換至工具」之間難以抉擇，導致執行路徑次佳。此困難源於高品質交錯 GUI-工具軌跡的稀缺、收集真實工具軌跡的高成本與脆弱性，以及缺乏用於 GUI-工具路徑選擇的軌跡層級監督。本文提出 ToolCUA，這是一個端對端代理，透過分階段訓練範式學習最優的 GUI-工具路徑選擇。我們首先引入一個交錯 GUI-工具軌跡擴增管線，該管線重新利用大量靜態 GUI 軌跡，並合成一個具備基礎工具庫，無需人工工程或真實工具軌跡收集，即可生成多樣化的 GUI-工具軌跡。接著執行工具引導的 GUI RFT，結合暖啟動 SFT 與單輪 RL，以改善關鍵 GUI-工具切換點的決策。最後，我們在一個高保真度的 GUI-工具環境中，透過線上代理強化學習優化 ToolCUA，並以工具效率路徑獎勵引導，鼓勵適當使用工具與更短的執行路徑。在 OSWorld-MCP 上的實驗顯示，ToolCUA 達到 46.85% 的準確率，相較基線提升約 66%，在同規模模型中創下新的最佳表現。相較於純 GUI 設定，ToolCUA 也提升了 3.9%，證實了有效的 GUI-工具協作。結果進一步表明，在混合行動空間中進行訓練，是構建真實世界數位代理的一個有前景的範式。開源位置：https://x-plug.github.io/ToolCUA/

English

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

ToolCUA：面向计算机使用代理的最佳GUI工具路径编排

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

摘要

Support