通过用户界面分解与合成实现计算机使用基础的规模化扩展
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
May 19, 2025
作者: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
cs.AI
摘要
圖形用戶界面(GUI)基礎能力,即能將自然語言指令映射至圖形用戶界面上的具體操作,仍是計算機使用代理開發中的關鍵瓶頸。現有基準測試過於簡化基礎任務為簡短的指代表達,未能捕捉到現實交互中所需的軟件常識、佈局理解及精細操作能力的複雜性。為解決這些局限,我們引入了OSWorld-G,這是一個包含564個精細註釋樣本的綜合基準,涵蓋文本匹配、元素識別、佈局理解及精確操作等多樣化任務類型。此外,我們合成並發布了最大的計算機使用基礎數據集Jedi,通過任務的多視角解耦,該數據集包含400萬個示例。基於Jedi訓練的多尺度模型在ScreenSpot-v2、ScreenSpot-Pro及我們的OSWorld-G上均超越了現有方法,證明了其有效性。進一步地,我們展示了利用Jedi提升的基礎能力直接增強了通用基礎模型在複雜計算機任務上的代理性能,在OSWorld上的表現從5%提升至27%。通過詳細的消融研究,我們識別了影響基礎性能的關鍵因素,並驗證了結合針對不同界面元素的專業數據能夠實現對新界面的組合泛化。所有基準、數據、檢查點及代碼均已開源,並可於https://osworld-grounding.github.io獲取。
English
Graphical user interface (GUI) grounding, the ability to map natural language
instructions to specific actions on graphical user interfaces, remains a
critical bottleneck in computer use agent development. Current benchmarks
oversimplify grounding tasks as short referring expressions, failing to capture
the complexity of real-world interactions that require software commonsense,
layout understanding, and fine-grained manipulation capabilities. To address
these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising
564 finely annotated samples across diverse task types including text matching,
element recognition, layout understanding, and precise manipulation.
Additionally, we synthesize and release the largest computer use grounding
dataset Jedi, which contains 4 million examples through multi-perspective
decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its
effectiveness by outperforming existing approaches on ScreenSpot-v2,
ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved
grounding with Jedi directly enhances agentic capabilities of general
foundation models on complex computer tasks, improving from 5% to 27% on
OSWorld. Through detailed ablation studies, we identify key factors
contributing to grounding performance and verify that combining specialized
data for different interface elements enables compositional generalization to
novel interfaces. All benchmark, data, checkpoints, and code are open-sourced
and available at https://osworld-grounding.github.io.Summary
AI-Generated Summary