基于人类演示的计算机使用代理基础研究
Grounding Computer Use Agents on Human Demonstrations
November 10, 2025
作者: Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar
cs.AI
摘要
构建可靠的计算机使用智能体需要实现精准的语义关联:将自然语言指令与正确的屏幕元素准确对应。尽管网络和移动交互领域已存在大规模数据集,但针对桌面环境的高质量资源仍然有限。为填补这一空白,我们推出了GroundCUA——一个基于专家演示构建的大规模桌面语义关联数据集。该数据集涵盖12个类别下的87种应用程序,包含5.6万张屏幕截图,每个屏幕元素均经过人工精细标注,总计超过356万条人工核验的注释。基于这些演示数据,我们生成了涵盖广泛真实任务场景的多样化指令,为模型训练提供高质量数据支撑。
利用GroundCUA数据集,我们开发了能够将指令映射至目标UI元素的GroundNext系列模型。在30亿和70亿参数规模下,通过监督微调,GroundNext在五项基准测试中均达到最先进水平,且所需训练数据不足先前工作的十分之一。强化学习后训练进一步提升了模型性能,在OSWorld基准测试中以o3作为规划器的智能体场景中,GroundNext取得了与使用更多数据训练的模型相当或更优的结果。这些成果证明了由专家驱动的高质量数据集对推进通用计算机使用智能体发展的关键作用。
English
Building reliable computer-use agents requires grounding: accurately
connecting natural language instructions to the correct on-screen elements.
While large datasets exist for web and mobile interactions, high-quality
resources for desktop environments are limited. To address this gap, we
introduce GroundCUA, a large-scale desktop grounding dataset built from expert
human demonstrations. It covers 87 applications across 12 categories and
includes 56K screenshots, with every on-screen element carefully annotated for
a total of over 3.56M human-verified annotations. From these demonstrations, we
generate diverse instructions that capture a wide range of real-world tasks,
providing high-quality data for model training. Using GroundCUA, we develop the
GroundNext family of models that map instructions to their target UI elements.
At both 3B and 7B scales, GroundNext achieves state-of-the-art results across
five benchmarks using supervised fine-tuning, while requiring less than
one-tenth the training data of prior work. Reinforcement learning post-training
further improves performance, and when evaluated in an agentic setting on the
OSWorld benchmark using o3 as planner, GroundNext attains comparable or
superior results to models trained with substantially more data,. These results
demonstrate the critical role of high-quality, expert-driven datasets in
advancing general-purpose computer-use agents.