涵蓋電腦使用中的人類操作空間：數據合成與基準

摘要

計算機使用代理（CUA）能自動化螢幕上的工作，例如GPT-5.4和Claude所示範的。然而，它們在處理複雜且低頻的互動時，可靠性仍然不佳，限制了用戶的信任。我們對先進模型失敗案例的分析顯示，GUI操作中存在一種長尾模式，即相對少數的複雜且多樣的互動卻佔據了不成比例的高任務失敗比率。我們假設這個問題主要源於複雜互動的數據稀缺。為了解決這個問題，我們提出了一個新的基準測試CUActSpot，用於評估模型在五種模態（GUI、文字、表格、畫布和自然圖像）以及多種動作（點擊、拖曳、繪製等）上的複雜互動能力，其涵蓋的互動型別範圍比以往主要聚焦GUI元件的點擊中心基準更為廣泛。我們還設計了一套基於渲染器的數據合成流水線：針對每種模態自動生成場景，記錄螢幕截圖和元素坐標，並由LLM產生對應的指令與動作軌跡。在此數據集上訓練後，我們的Phi-Ground-Any-4B在少於320億參數的開源模型中表現最佳。我們將在https://github.com/microsoft/Phi-Ground.git 發布我們的基準測試、數據、程式碼與模型。

English

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

涵蓋電腦使用中的人類操作空間：數據合成與基準

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

摘要

Support