データの壁を打ち破る——タスクの一般化を通じたGUIエージェントの構築

要旨

グラフィカルユーザーインターフェース（GUI）エージェントは、複雑なデジタルタスクの自動化におけるクロスプラットフォームソリューションを提供し、生産性ワークフローの変革に大きな可能性を秘めています。しかし、その性能は高品質な軌跡データの不足によって制約されることが多いです。この制限に対処するため、我々はデータが豊富で推論が重要なタスクに特化した中間訓練段階でVision Language Models（VLM）を訓練し、これらのタスクを組み込むことがGUI計画シナリオへの汎化をどのように促進するかを検証します。具体的には、GUI知覚、マルチモーダル推論、テキスト推論など、容易に利用可能な指示調整データを持つ一連のタスクを探索します。11の中間訓練タスクにわたる広範な実験を通じて、以下のことを実証します：（1）タスク汎化は非常に効果的であり、ほとんどの設定で大幅な改善をもたらします。例えば、マルチモーダル数学推論はAndroidWorldでの性能を絶対値で6.3%向上させます。注目すべきは、テキストのみの数学データがGUIウェブエージェントの性能を大幅に向上させ、WebArenaで5.6%、AndroidWorldで5.4%の改善を達成し、テキストベースから視覚領域への顕著なクロスモーダル汎化を示しています；（2）従来の仮定とは異なり、GUIエージェントタスクに密接に関連し、広く訓練に利用されてきたGUI知覚データは、最終的な性能に比較的限定的な影響しか及ぼしません；（3）これらの知見を基に、最も効果的な中間訓練タスクを特定し、最適化された混合データセットを構築し、WebArenaで8.0%、AndroidWorldで12.2%の絶対的な性能向上を実現しました。我々の研究は、GUIエージェントにおけるクロスドメイン知識転移に関する貴重な洞察を提供し、この新興分野におけるデータ不足の課題に対処する実践的なアプローチを提供します。コード、データ、モデルはhttps://github.com/hkust-nlp/GUIMidで公開されます。

English

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

データの壁を打ち破る——タスクの一般化を通じたGUIエージェントの構築

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

要旨

Support