ACE-Ego-0: Het verenigen van egocentrische menselijke en robotische data voor VLA-pretraining

Samenvatting

Visie-Taal-Actie (VLA)-modellen profiteren van grootschalige en diverse belichaamde data, maar het opschalen van het verzamelen van robottrajectorieën is kostbaar en arbeidsintensief. Recente vooruitgang laat zien dat grootschalige egocentrische menselijke video's complementaire real-world supervisie bieden tijdens pre-training. Echter, gezamenlijke training op menselijke en robotdata blijft uitdagend vanwege verschillen in actieruimtes, belichamingsstructuren, temporele dynamiek en supervisiekwaliteit. We introduceren ACE-EGO-0, een uniform VLA-pre-trainingskader dat gezamenlijk gebruikmaakt van heterogene databronnen. Om grootschalige pre-trainingssupervisie uit egocentrische menselijke video's te extraheren, bouwen we een schaalbare egocentrische video-naar-actie-pijplijn die ruwe menselijke video's omzet in pseudo-actietrajectorieën in robotformaat. Om deze labels vergelijkbaar te maken met robotdemonstraties, gebruikt ACE-EGO-0 een uniforme actierepresentatie gebaseerd op camera-ruimte acties, morfologieconditionering en tijd-uitgelijnde actiechunking. Om robuust gebruik te maken van ruizige pseudo-actiesupervisie uit egocentrische menselijke video's, formuleren we een betrouwbaarheidsbewuste trainingsdoelstelling met een menselijk hulpverlies dat supervisie concentreert op betrouwbare signalen. We instantiëren ACE-EGO-0 op 4.530 uur robot- en simulatiegegevens, samen met 1.480 uur pseudo-actiegelabelde egocentrische menselijke data. Experimenten tonen aan dat het opnemen van grootschalige menselijke supervisie onder betrouwbaarheidsbewuste weging zowel de uniforme gezamenlijke pre-training als de begeleide fine-tuning consistent verbetert. ACE-EGO-0 behaalt state-of-the-art prestaties op RoboCasa GR1 TableTop en RoboTwin 2.0, terwijl het sterke overdracht naar real-world bimanuele manipulatie laat zien.

English

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.