ChatPaper.aiChatPaper

ACE-Ego-0: 為VLA預訓練統一第一人稱視角的人類與機器人數據

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

June 15, 2026
作者: Hao Li, Ganlong Zhao, Yufei Liu, Haotian Hou, Guoquan Ye, Tongyan Fang, Chunxiao Liu, Siyuan Huang, Jianbo Liu, Xiaogang Wang, Hongsheng Li
cs.AI

摘要

視覺-語言-行動(VLA)模型受益於大規模且多樣的具身數據,然而,收集機器人軌跡的成本高昂且耗費人力。近期進展顯示,大規模的第一人稱人類影片可在預訓練中提供互補的真實世界監督。然而,由於動作空間、具身結構、時間動態與監督品質的差異,在人類與機器人數據上進行聯合訓練仍具挑戰性。我們提出ACE-EGO-0,一個統一的VLA預訓練框架,能聯合運用異質數據來源。為從第一人稱人類影片中提取大規模預訓練監督,我們建立了一個可擴展的第一人稱影片到動作流程,將原始人類影片轉換為機器人格式的偽動作軌跡。為使這些標籤能與機器人示範相較,ACE-EGO-0採用基於相機空間動作、形態條件化以及時間對齊動作區塊化的統一動作表徵。為穩健地利用來自第一人稱人類影片的雜訊偽動作監督,我們制定了一個具可靠性感知的訓練目標,並搭配人類輔助損失函數,將監督集中於可靠訊號上。我們將ACE-EGO-0實例化於4,530小時的機器人與模擬數據,以及1,480小時的偽動作標註第一人稱人類數據。實驗結果顯示,在可靠性感知加權下納入大規模人類監督,能一致地提升統一聯合預訓練與監督微調的表現。ACE-EGO-0在RoboCasa GR1 TableTop與RoboTwin 2.0上達到最先進的效能,同時展現出對真實世界雙手操作的強大遷移能力。
English
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.