ACE-Ego-0:统一第一人称人类与机器人数据用于VLA预训练
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
June 15, 2026
作者: Hao Li, Ganlong Zhao, Yufei Liu, Haotian Hou, Guoquan Ye, Tongyan Fang, Chunxiao Liu, Siyuan Huang, Jianbo Liu, Xiaogang Wang, Hongsheng Li
cs.AI
摘要
视觉-语言-动作(VLA)模型得益于大规模多样化的具身数据,但扩展机器人轨迹收集成本高昂且劳动密集。最近的进展表明,大规模以自我为中心的人类视频在预训练中提供了互补的真实世界监督。然而,由于动作空间、具身结构、时间动态和监督质量上的差异,对人类数据和机器人数据的联合训练仍然具有挑战性。我们引入了ACE-EGO-0,这是一个统一的VLA预训练框架,能够联合利用异构数据源。为了从以自我为中心的人类视频中提取大规模预训练监督,我们构建了一个可扩展的以自我为中心的视频到动作流水线,将原始人类视频转换为机器人格式的伪动作轨迹。为了使这些标签与机器人演示可比,ACE-EGO-0使用基于相机空间动作、形态条件和时间对齐动作分块的统一动作表示。为了鲁棒地利用来自以自我为中心的人类视频的噪声伪动作监督,我们制定了一个具有人类辅助损失的可靠性感知训练目标,将监督集中在可靠信号上。我们在4.53千小时的机器人和模拟数据以及1.48千小时的伪动作标注的以自我为中心的人类数据上实例化ACE-EGO-0。实验表明,在可靠性感知加权下纳入大规模人类监督,一致地改进了统一的联合预训练和监督微调。ACE-EGO-0在RoboCasa GR1 TableTop和RoboTwin 2.0上实现了最先进的性能,同时展示了向真实世界双臂操作的强迁移能力。
English
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.