D2E: Schaalbaarheid van visie-actie vooraf trainen op desktopgegevens voor overdracht naar belichaamde AI

Samenvatting

Grote taalmodellen maken gebruik van internet-schaal tekstdata, maar embodied AI blijft beperkt door de hoge kosten van het verzamelen van fysieke trajecten. Desktopomgevingen – met name gaming – bieden een overtuigend alternatief: ze bieden rijke sensomotorische interacties op schaal, terwijl ze de gestructureerde observatie-actiekoppeling behouden die essentieel is voor embodied learning. Wij presenteren D2E (Desktop to Embodied AI), een raamwerk dat aantoont dat desktopinteracties kunnen dienen als een effectief voorbereidend substraat voor robotics embodied AI-taken. In tegenstelling tot eerder werk dat domeinspecifiek bleef (bijvoorbeeld VPT voor Minecraft) of data propriëtair hield (bijvoorbeeld SIMA), stelt D2E een complete pijplijn op van schaalbare desktopdatacollectie tot geverifieerde overdracht in embodied domeinen. Ons raamwerk bestaat uit drie componenten: (1) de OWA Toolkit die diverse desktopinteracties verenigt in een gestandaardiseerd formaat met een compressie van 152x, (2) de Generalist-IDM die sterke zero-shot generalisatie bereikt over onbekende spellen door timestamp-gebaseerde gebeurtenisvoorspelling, waardoor internet-schaal pseudo-labeling mogelijk wordt, en (3) VAPT die desktop-voorbereide representaties overdraagt naar fysieke manipulatie en navigatie. Met behulp van 1.300+ uur aan data (259 uur aan menselijke demonstraties en 1.000+ uur aan pseudo-gelabelde gameplay) bereiken we een totaal succespercentage van 96,6% op de LIBERO-manipulatiebenchmark en 83,3% op de CANVAS-navigatiebenchmark. Dit valideert dat sensomotorische primitieven in digitale interacties voldoende invariantie vertonen om betekenisvol over te dragen naar fysieke embodied taken, waardoor desktopvoorbereiding wordt gevestigd als een praktisch paradigma voor robotica. Wij zullen al ons werk openbaar maken, inclusief de OWA-toolkit, datasets van door mensen verzamelde en pseudo-gelabelde data, en VAPT-getrainde modellen, beschikbaar op https://worv-ai.github.io/d2e/.

English

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

D2E: Schaalbaarheid van visie-actie vooraf trainen op desktopgegevens voor overdracht naar belichaamde AI

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Samenvatting

Support