UltraCUA:一种基于混合动作的计算机使用代理基础模型
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
October 20, 2025
作者: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
cs.AI
摘要
多模态计算机使用代理(CUAs)完全依赖于需要精确视觉定位和冗长执行链的原始操作(点击、键入、滚动),这导致了级联故障和性能瓶颈。尽管其他代理利用丰富的编程接口(API、MCP服务器、工具),计算机使用代理却与这些能力隔绝。我们提出了UltraCUA,一个通过混合操作——无缝整合图形用户界面(GUI)原始操作与高级编程工具调用——来弥合这一差距的基础模型。为实现这一目标,我们的方法包含四个关键组成部分:(1)一个自动化流程,从软件文档、开源仓库和代码生成中扩展编程工具;(2)一个合成数据引擎,生成超过17,000个可验证任务,覆盖现实世界中的计算机使用场景;(3)一个大规模高质量混合操作轨迹收集,包含低级GUI操作和高级编程工具调用;(4)一个两阶段训练流程,结合监督微调与在线强化学习,实现低级与高级操作之间的策略性交替。我们通过7B和32B模型的实验展示了相对于最先进代理的显著改进。在OSWorld上,UltraCUA模型相较于基础模型实现了平均22%的相对提升,同时步骤执行速度提高了11%。在WindowsAgentArena的域外评估中,我们的模型达到了21.7%的成功率,优于基于Windows数据训练的基线模型。混合操作机制被证明至关重要,它在保持执行效率的同时减少了错误传播。
English
Multimodal agents for computer use rely exclusively on primitive actions
(click, type, scroll) that require accurate visual grounding and lengthy
execution chains, leading to cascading failures and performance bottlenecks.
While other agents leverage rich programmatic interfaces (APIs, MCP servers,
tools), computer-use agents (CUAs) remain isolated from these capabilities. We
present UltraCUA, a foundation model that bridges this gap through hybrid
action -- seamlessly integrating GUI primitives with high-level programmatic
tool calls. To achieve this, our approach comprises four key components: (1) an
automated pipeline that scales programmatic tools from software documentation,
open-source repositories, and code generation; (2) a synthetic data engine
producing over 17,000 verifiable tasks spanning real-world computer-use
scenarios; (3) a large-scale high-quality hybrid action trajectory collection
with both low-level GUI actions and high-level programmatic tool calls; and (4)
a two-stage training pipeline combining supervised fine-tuning with online
reinforcement learning, enabling strategic alternation between low-level and
high-level actions. Experiments with our 7B and 32B models demonstrate
substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA
models achieve an average 22% relative improvement over base models, while
being 11% faster in terms of steps. Out-of-domain evaluation on
WindowsAgentArena shows our model reaches 21.7% success rate, outperforming
baselines trained on Windows data. The hybrid action mechanism proves critical,
reducing error propagation while maintaining execution efficiency.