ChatPaper.aiChatPaper

D2E:基於桌面數據的視覺-動作預訓練擴展及其在具身AI中的遷移應用

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

October 7, 2025
作者: Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee
cs.AI

摘要

大型語言模型利用互聯網規模的文本數據,然而具身AI仍受限於物理軌跡收集的高昂成本。桌面環境——尤其是遊戲——提供了一個引人注目的替代方案:它們在保持結構化觀察-動作耦合(這對具身學習至關重要)的同時,提供了大規模的豐富感知運動交互。我們提出了D2E(桌面到具身AI)框架,展示了桌面交互可以作為機器人具身AI任務的有效預訓練基礎。與之前局限於特定領域(如Minecraft的VPT)或數據保持專有(如SIMA)的工作不同,D2E建立了一個從可擴展的桌面數據收集到具身領域驗證轉移的完整流程。我們的框架包含三個組件:(1) OWA工具包,將多樣化的桌面交互統一為標準化格式,並實現152倍的壓縮;(2) Generalist-IDM,通過基於時間戳的事件預測實現對未見過遊戲的強大零樣本泛化,支持互聯網規模的偽標註;(3) VAPT,將桌面預訓練的表徵轉移到物理操作和導航任務中。利用超過1300小時的數據(259小時的人類示範和1000+小時的偽標註遊戲玩法),我們在LIBERO操作任務上達到了96.6%的成功率,在CANVAS導航基準上達到了83.3%。這驗證了數字交互中的感知運動原語具有足夠的不變性,能夠有意義地轉移到物理具身任務中,確立了桌面預訓練作為機器人學的實用範式。我們將公開所有工作,包括OWA工具包、人類收集和偽標註的數據集,以及VAPT訓練的模型,詳見https://worv-ai.github.io/d2e/。
English
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/
PDF1293October 13, 2025