覆盖计算机使用中的人类操作空间：数据合成与基准测试

摘要

计算机使用代理（CUAs）能够自动化屏幕操作，如GPT-5.4和Claude所示。然而，它们在复杂、低频交互场景下的可靠性依然较差，限制了用户的信任。通过对先进模型失败案例的分析，我们发现图形用户界面（GUI）操作中存在一种长尾模式：相对少数的复杂多样交互占据了不成比例的任务失败比例。我们假设这一问题主要源于复杂交互数据的稀缺性。为解决此问题，我们提出新基准CUActSpot，用于评估模型在GUI、文本、表格、画布和自然图像五种模态下的复杂交互能力，涵盖点击、拖拽、绘制等多种操作类型，相较于以往仅关注GUI控件的以点击为中心的基准测试，覆盖了更广泛的交互类型。我们还设计了一套基于渲染器的数据合成流程：为每种模态自动生成场景，记录截图与元素坐标，并由大语言模型生成对应的指令和操作轨迹。在该语料库上训练后，我们的Phi-Ground-Any-4B模型在参数少于320亿的开源模型中表现最优。我们将于https://github.com/microsoft/Phi-Ground.git开放基准、数据、代码和模型。

English

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

覆盖计算机使用中的人类操作空间：数据合成与基准测试

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

摘要

Support