ChatPaper.aiChatPaper

覆盖计算机使用中的人类操作空间:数据合成与基准测试

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

May 12, 2026
作者: Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo
cs.AI

摘要

计算机使用代理(CUAs)能够自动化屏幕操作,如GPT-5.4和Claude所示。然而,它们在复杂、低频交互场景下的可靠性依然较差,限制了用户的信任。通过对先进模型失败案例的分析,我们发现图形用户界面(GUI)操作中存在一种长尾模式:相对少数的复杂多样交互占据了不成比例的任务失败比例。我们假设这一问题主要源于复杂交互数据的稀缺性。为解决此问题,我们提出新基准CUActSpot,用于评估模型在GUI、文本、表格、画布和自然图像五种模态下的复杂交互能力,涵盖点击、拖拽、绘制等多种操作类型,相较于以往仅关注GUI控件的以点击为中心的基准测试,覆盖了更广泛的交互类型。我们还设计了一套基于渲染器的数据合成流程:为每种模态自动生成场景,记录截图与元素坐标,并由大语言模型生成对应的指令和操作轨迹。在该语料库上训练后,我们的Phi-Ground-Any-4B模型在参数少于320亿的开源模型中表现最优。我们将于https://github.com/microsoft/Phi-Ground.git开放基准、数据、代码和模型。
English
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
PDF101May 14, 2026