UltraCUA:一种具备混合动作能力的计算机使用代理基础模型
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
October 20, 2025
作者: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
cs.AI
摘要
用于计算机操作的多模态代理完全依赖于原始操作(点击、输入、滚动),这些操作需要精确的视觉定位和冗长的执行链,导致级联故障和性能瓶颈。尽管其他代理利用丰富的编程接口(API、MCP服务器、工具),但计算机操作代理(CUAs)仍然与这些能力隔绝。我们提出了UltraCUA,这是一个基础模型,通过混合操作——无缝整合图形用户界面(GUI)原始操作与高级编程工具调用——来弥合这一差距。为实现这一目标,我们的方法包含四个关键组成部分:(1)一个自动化流程,从软件文档、开源仓库和代码生成中扩展编程工具;(2)一个合成数据引擎,生成超过17,000个可验证任务,覆盖现实世界的计算机操作场景;(3)大规模高质量混合操作轨迹收集,包含低级GUI操作和高级编程工具调用;(4)两阶段训练流程,结合监督微调与在线强化学习,实现低级与高级操作之间的策略性切换。通过我们的7B和32B模型进行的实验显示,相较于最先进的代理,UltraCUA模型在OSWorld上实现了平均22%的相对提升,同时步骤执行速度提高了11%。在WindowsAgentArena上的跨域评估中,我们的模型达到了21.7%的成功率,优于基于Windows数据训练的基线模型。混合操作机制被证明至关重要,在保持执行效率的同时减少了错误传播。
English
Multimodal agents for computer use rely exclusively on primitive actions
(click, type, scroll) that require accurate visual grounding and lengthy
execution chains, leading to cascading failures and performance bottlenecks.
While other agents leverage rich programmatic interfaces (APIs, MCP servers,
tools), computer-use agents (CUAs) remain isolated from these capabilities. We
present UltraCUA, a foundation model that bridges this gap through hybrid
action -- seamlessly integrating GUI primitives with high-level programmatic
tool calls. To achieve this, our approach comprises four key components: (1) an
automated pipeline that scales programmatic tools from software documentation,
open-source repositories, and code generation; (2) a synthetic data engine
producing over 17,000 verifiable tasks spanning real-world computer-use
scenarios; (3) a large-scale high-quality hybrid action trajectory collection
with both low-level GUI actions and high-level programmatic tool calls; and (4)
a two-stage training pipeline combining supervised fine-tuning with online
reinforcement learning, enabling strategic alternation between low-level and
high-level actions. Experiments with our 7B and 32B models demonstrate
substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA
models achieve an average 22% relative improvement over base models, while
being 11% faster in terms of steps. Out-of-domain evaluation on
WindowsAgentArena shows our model reaches 21.7% success rate, outperforming
baselines trained on Windows data. The hybrid action mechanism proves critical,
reducing error propagation while maintaining execution efficiency.