コンピュータ利用の基盤を拡張するためのユーザーインターフェースの分解と合成

要旨

グラフィカルユーザーインターフェース（GUI）グラウンディング、すなわち自然言語の指示をグラフィカルユーザーインターフェース上の特定のアクションにマッピングする能力は、コンピュータ利用エージェントの開発における重要なボトルネックとなっている。現在のベンチマークは、グラウンディングタスクを短い参照表現として過度に単純化しており、ソフトウェアの常識、レイアウト理解、細かい操作能力を必要とする現実世界のインタラクションの複雑さを捉えられていない。これらの制限に対処するため、我々はOSWorld-Gを導入した。これは、テキストマッチング、要素認識、レイアウト理解、精密操作など多様なタスクタイプにわたる564の詳細に注釈付けされたサンプルからなる包括的なベンチマークである。さらに、タスクの多視点分解を通じて生成された最大のコンピュータ利用グラウンディングデータセットJediを合成し、公開した。Jediには400万の例が含まれている。Jediでトレーニングされたマルチスケールモデルは、ScreenSpot-v2、ScreenSpot-Pro、および我々のOSWorld-Gにおいて既存のアプローチを上回ることでその有効性を実証した。さらに、Jediによる改善されたグラウンディングが、複雑なコンピュータタスクにおける一般的な基盤モデルのエージェント能力を直接向上させ、OSWorldにおいて5%から27%に改善されることを示した。詳細なアブレーション研究を通じて、グラウンディング性能に寄与する主要な要因を特定し、異なるインターフェース要素に対する専門データを組み合わせることで、新しいインターフェースへの合成的汎化が可能になることを検証した。すべてのベンチマーク、データ、チェックポイント、コードはオープンソース化されており、https://osworld-grounding.github.ioで公開されている。

English

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

コンピュータ利用の基盤を拡張するためのユーザーインターフェースの分解と合成

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

要旨

Support