ToolCUA: 面向计算机使用智能体的最优GUI工具路径编排

摘要

计算机使用智能体（CUAs）可通过原子级图形用户界面（GUI）操作（如点击、键入）与高级工具调用（如基于API的文件操作）协同执行任务，但这种混合动作空间常导致其在继续执行GUI操作与切换至工具间产生决策模糊，进而引发次优执行路径。该困境源于高质量交错式GUI-工具轨迹的稀缺性、真实工具轨迹采集的高成本与脆弱性，以及缺乏针对GUI-工具路径选择的轨迹级监督机制。本文提出ToolCUA——一种通过分阶段训练范式学习最优GUI-工具路径选择的端到端智能体。首先，我们引入交错式GUI-工具轨迹缩放管线，该管线复用丰富的静态GUI轨迹并合成基础工具库，无需人工工程或真实工具轨迹采集即可生成多样化GUI-工具轨迹。继而执行工具引导的GUI强化微调（RFT），将预热监督微调与单轮强化学习相结合，以改善关键GUI-工具切换点的决策质量。最终，我们在高保真GUI-工具环境中通过在线智能强化学习优化ToolCUA，并辅以工具高效路径奖励机制，引导智能体合理运用工具并缩短执行路径。在OSWorld-MCP上的实验表明，ToolCUA达到46.85%的准确率，相较基线取得约66%的相对性能提升，在同规模模型中创造新纪录；在纯GUI设置基础上提升3.9%，验证了其有效的GUI-工具编排能力。研究结果进一步表明，混合动作空间训练是面向真实世界数字智能体的有效范式。开源地址：https://x-plug.github.io/ToolCUA/

English

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

ToolCUA: 面向计算机使用智能体的最优GUI工具路径编排

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

摘要

Support