コンピュータ利用のための人間行動空間の網羅：データ合成とベンチマーク

要旨

コンピュータ利用エージェント（CUA）は、GPT-5.4やClaudeに示されるように、画面上の作業を自動化する。しかし、複雑で低頻度なインタラクションに対する信頼性は依然として低く、ユーザの信頼を制限している。先進モデルの失敗事例を分析したところ、GUI操作においてロングテールパターンが示唆され、複雑で多様なインタラクションの比較的少数の部分が、タスクの失敗の不均衡な割合を占めている。我々は、この問題が主に複雑なインタラクションのデータ不足に起因するという仮説を立てる。この問題に対処するため、新しいベンチマークCUActSpotを提案する。これは、GUI、テキスト、テーブル、キャンバス、自然画像の5つのモダリティにわたる複雑なインタラクションにおけるモデルの能力を評価するためのものであり、クリック、ドラッグ、描画などの様々なアクションを含み、主にGUIウィジェットに焦点を当てた従来のクリック中心のベンチマークよりも広範なインタラクションタイプをカバーする。また、レンダラベースのデータ合成パイプラインを設計する。各モダリティに対してシーンが自動生成され、スクリーンショットと要素の座標が記録され、LLMが対応する指示とアクショントレースを生成する。このコーパスで学習した後、我々のPhi-Ground-Any-4Bは、32Bパラメータ未満のオープンソースモデルを上回る性能を示す。我々は、ベンチマーク、データ、コード、モデルをhttps://github.com/microsoft/Phi-Ground.gitでリリースする予定である。

English

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

コンピュータ利用のための人間行動空間の網羅：データ合成とベンチマーク

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

要旨

Support