基於人類示範的電腦使用代理系統基礎建構
Grounding Computer Use Agents on Human Demonstrations
November 10, 2025
作者: Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar
cs.AI
摘要
建構可靠的電腦使用代理程式需要基礎定位能力:精準地將自然語言指令與正確的螢幕元素建立關聯。儘管現有大量針對網頁和行動裝置互動的資料集,但針對桌面環境的高品質資源仍相當有限。為填補此缺口,我們推出GroundCUA——一個基於專家真人示範建構的大規模桌面基礎定位資料集。該資料集涵蓋12大類別的87種應用程式,包含5.6萬張螢幕截圖,每張截圖上的所有介面元素均經過人工細緻標註,總計超過356萬筆經人工驗證的註解。我們從這些示範中生成多樣化指令,涵蓋各類真實場景任務,為模型訓練提供高品質資料。運用GroundCUA資料集,我們開發出能將指令映射至目標UI元素的GroundNext模型系列。無論是30億參數還是70億參數版本,GroundNext在五項基準測試中僅需不到先前研究十分之一的訓練資料量,即可透過監督式微調達到最先進成果。強化學習後續訓練進一步提升模型表現,當在OSWorld基準測試中以o3作為規劃器的代理情境中評估時,GroundNext達成與使用更大量資料訓練的模型相當或更優異的成果。這些結果證明了由專家驅動的高品質資料集對於推進通用型電腦使用代理程式發展的關鍵作用。
English
Building reliable computer-use agents requires grounding: accurately
connecting natural language instructions to the correct on-screen elements.
While large datasets exist for web and mobile interactions, high-quality
resources for desktop environments are limited. To address this gap, we
introduce GroundCUA, a large-scale desktop grounding dataset built from expert
human demonstrations. It covers 87 applications across 12 categories and
includes 56K screenshots, with every on-screen element carefully annotated for
a total of over 3.56M human-verified annotations. From these demonstrations, we
generate diverse instructions that capture a wide range of real-world tasks,
providing high-quality data for model training. Using GroundCUA, we develop the
GroundNext family of models that map instructions to their target UI elements.
At both 3B and 7B scales, GroundNext achieves state-of-the-art results across
five benchmarks using supervised fine-tuning, while requiring less than
one-tenth the training data of prior work. Reinforcement learning post-training
further improves performance, and when evaluated in an agentic setting on the
OSWorld benchmark using o3 as planner, GroundNext attains comparable or
superior results to models trained with substantially more data,. These results
demonstrate the critical role of high-quality, expert-driven datasets in
advancing general-purpose computer-use agents.