ChatPaper.aiChatPaper

通过用户界面分解与合成实现计算机使用基础的规模化扩展

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

May 19, 2025
作者: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
cs.AI

摘要

图形用户界面(GUI)的语义理解能力,即将自然语言指令映射到图形用户界面上具体操作的能力,仍然是计算机使用代理开发中的关键瓶颈。现有基准测试将语义理解任务过度简化为简短的指代表达,未能捕捉到现实交互中所需的软件常识、布局理解及精细操作能力的复杂性。为应对这些局限,我们推出了OSWorld-G,一个包含564个精细标注样本的全面基准测试,涵盖文本匹配、元素识别、布局理解及精确操作等多种任务类型。此外,我们合成并发布了最大的计算机使用语义理解数据集Jedi,通过任务的多视角解耦,包含了400万个示例。在Jedi上训练的多尺度模型,在ScreenSpot-v2、ScreenSpot-Pro及我们的OSWorld-G上均超越了现有方法,证明了其有效性。更进一步,我们展示了利用Jedi提升的语义理解能力,直接增强了通用基础模型在复杂计算机任务上的代理能力,在OSWorld上的表现从5%提升至27%。通过详尽的消融研究,我们识别了影响语义理解性能的关键因素,并验证了结合针对不同界面元素的专门数据,能够实现对新界面的组合泛化。所有基准测试、数据、检查点及代码均已开源,可在https://osworld-grounding.github.io获取。
English
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Summary

AI-Generated Summary

PDF342May 20, 2025