ChatPaper.aiChatPaper

GUI-360:面向计算机使用智能体的综合数据集与基准框架

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

November 6, 2025
作者: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI

摘要

我们推出GUI-360°——一个大规模综合性数据集与基准测试套件,旨在推动计算机使用智能体(CUA)的发展。CUA研究面临独特挑战,存在三大长期瓶颈:真实CUA任务稀缺、多模态轨迹自动采集标注流程缺失、以及缺乏统一评估GUI定位、屏幕解析与行动预测的基准体系。GUI-360°通过LLM增强的自动化流程解决这些问题,涵盖查询生成、环境模板构建、任务实例化、批量执行及LLM驱动的质量过滤。发布的数据集包含数千条Windows办公软件操作轨迹,逾120万执行步骤,涵盖全分辨率屏幕截图、可获取的辅助功能元数据、实例化目标、中间推理轨迹、成功与失败操作记录。该数据集支持GUI定位、屏幕解析、行动预测三大核心任务,并提供反映现代智能体设计的GUI+API混合行动空间。基于GUI-360°对前沿视觉-语言模型的测试显示,现有模型在定位与行动预测方面存在显著不足;监督微调与强化学习虽能带来明显提升,但仍未达到人类可靠性水平。我们公开GUI-360°数据集及配套代码,以促进可复现研究并加速稳健桌面CUA的发展。完整数据集已发布于https://huggingface.co/datasets/vyokky/GUI-360。
English
We introduce GUI-360^circ, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360^circ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360^circ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360^circ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.
PDF142December 2, 2025