GUI-360:面向計算機使用代理的綜合數據集與基準測試框架
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
November 6, 2025
作者: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI
摘要
我們推出GUI-360^circ——一個大規模綜合數據集與基準測試套件,旨在推動計算機使用智能體(CUAs)的發展。CUAs面臨獨特挑戰並受三大長期缺口制約:真實世界CUA任務稀缺、缺乏多模態軌跡的自動化採集註釋流程,以及缺少能聯合評估GUI定位、屏幕解析與動作預測的統一基準。GUI-360^circ通過LLM增強的高度自動化流程(涵蓋查詢源獲取、環境模板構建、任務實例化、批量執行及LLM驅動的質量過濾)解決這些缺口。發布的數據集包含熱門Windows辦公軟件中數千條軌跡的逾120萬次執行動作步驟,涵蓋全分辨率截圖、可用的無障礙元數據、實例化目標、中間推理軌跡,以及成功與失敗的動作軌跡。該數據集支持三項核心任務(GUI定位、屏幕解析與動作預測)及反映現代智能體設計的混合GUI+API動作空間。在GUI-360^circ上對標頂尖視覺-語言模型的結果顯示,其在定位與動作預測方面存在顯著原生缺陷;監督微調與強化學習雖帶來明顯提升,但仍未達到人類級可靠性。我們公開GUI-360^circ及配套代碼,以促進可重現研究並加速魯棒桌面CUAs的發展。完整數據集已公開於:https://huggingface.co/datasets/vyokky/GUI-360。
English
We introduce GUI-360^circ, a large-scale, comprehensive dataset and
benchmark suite designed to advance computer-using agents (CUAs). CUAs present
unique challenges and is constrained by three persistent gaps: a scarcity of
real-world CUA tasks, the lack of automated collection-and-annotation pipelines
for multi-modal trajectories, and the absence of a unified benchmark that
jointly evaluates GUI grounding, screen parsing, and action prediction.
GUI-360^circ addresses these gaps with an LLM-augmented, largely automated
pipeline for query sourcing, environment-template construction, task
instantiation, batched execution, and LLM-driven quality filtering. The
released corpus contains over 1.2M executed action steps across thousands of
trajectories in popular Windows office applications, and includes
full-resolution screenshots, accessibility metadata when available,
instantiated goals, intermediate reasoning traces, and both successful and
failed action trajectories. The dataset supports three canonical tasks, GUI
grounding, screen parsing, and action prediction, and a hybrid GUI+API action
space that reflects modern agent designs. Benchmarking state-of-the-art
vision--language models on GUI-360^circ reveals substantial out-of-the-box
shortcomings in grounding and action prediction; supervised fine-tuning and
reinforcement learning yield significant gains but do not close the gap to
human-level reliability. We release GUI-360^circ and accompanying code to
facilitate reproducible research and accelerate progress on robust desktop
CUAs.
The full dataset has been made public on
https://huggingface.co/datasets/vyokky/GUI-360.