D2E:基于桌面数据扩展视觉-动作预训练以迁移至具身智能体
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
October 7, 2025
作者: Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee
cs.AI
摘要
大型语言模型利用互联网规模的文本数据,而具身人工智能仍受限于物理轨迹采集的高昂成本。桌面环境——尤其是游戏——提供了一个引人注目的替代方案:它们以大规模提供丰富的感知运动交互,同时保持了具身学习所必需的结构化观察-动作耦合。我们提出了D2E(桌面到具身AI)框架,展示了桌面交互可作为机器人具身AI任务的有效预训练基础。与之前局限于特定领域(如Minecraft的VPT)或数据保密的(如SIMA)工作不同,D2E建立了一个从可扩展的桌面数据收集到具身领域验证迁移的完整流程。我们的框架包含三个组成部分:(1) OWA工具包,将多样化的桌面交互统一为标准化格式,并实现152倍的压缩;(2) Generalist-IDM,通过基于时间戳的事件预测,在未见过的游戏中实现强大的零样本泛化,支持互联网规模的伪标签生成;(3) VAPT,将桌面预训练的表示迁移到物理操作和导航任务中。利用超过1300小时的数据(259小时的人类演示和1000+小时的伪标签游戏数据),我们在LIBERO操作任务上取得了96.6%的成功率,在CANVAS导航基准上达到了83.3%。这验证了数字交互中的感知运动原语具有足够的不变性,能够有意义地迁移到物理具身任务中,确立了桌面预训练作为机器人学的一个实用范式。我们将公开所有工作,包括OWA工具包、人类收集和伪标签的数据集,以及VAPT训练的模型,访问地址为https://worv-ai.github.io/d2e/。
English
Large language models leverage internet-scale text data, yet embodied AI
remains constrained by the prohibitive costs of physical trajectory collection.
Desktop environments -- particularly gaming -- offer a compelling alternative:
they provide rich sensorimotor interactions at scale while maintaining the
structured observation-action coupling essential for embodied learning. We
present D2E (Desktop to Embodied AI), a framework that demonstrates desktop
interactions can serve as an effective pretraining substrate for robotics
embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT
for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a
complete pipeline from scalable desktop data collection to verified transfer in
embodied domains. Our framework comprises three components: (1) the OWA Toolkit
that unifies diverse desktop interactions into a standardized format with 152x
compression, (2) the Generalist-IDM that achieves strong zero-shot
generalization across unseen games through timestamp-based event prediction,
enabling internet-scale pseudo-labeling, and (3) VAPT that transfers
desktop-pretrained representations to physical manipulation and navigation.
Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of
pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO
manipulation and 83.3% on CANVAS navigation benchmarks. This validates that
sensorimotor primitives in digital interactions exhibit sufficient invariance
to transfer meaningfully to physical embodied tasks, establishing desktop
pretraining as a practical paradigm for robotics. We will make all our work
public, including the OWA toolkit, datasets of human-collected and
pseudo-labeled, and VAPT-trained models available at
https://worv-ai.github.io/d2e/