Being-H0：基于大规模人类视频的视觉-语言-动作预训练模型

摘要

我们推出Being-H0，这是一款基于大规模人类视频训练的高灵巧性视觉-语言-动作模型（VLA）。现有VLA在处理需要高度灵巧性的复杂操控任务时表现欠佳，且在新场景和任务上的泛化能力较弱，主要原因在于它们过度依赖存在显著模拟与现实差距的合成数据，或是规模与多样性不足的远程操作演示。为突破这一数据瓶颈，我们提出以人类手部作为基础操控器，充分利用网络数据中蕴含的丰富灵巧性与可扩展性。我们的方法聚焦于物理指令调优，这是一种创新的训练范式，它结合了从人类视频中进行的大规模VLA预训练、面向三维推理的物理空间对齐，以及针对机器人任务的训练后适应。此外，我们引入了一种部件级运动标记化方法，该方法实现了毫米级的重建精度，以精确建模手部轨迹用于动作学习。为支撑这一范式，我们进一步开发了一套全面的数据整理流程，将包括动作捕捉、虚拟现实及仅RGB视频在内的多种数据源整合成一个包含数百万基于运动指令实例的大规模数据集。实验证明，Being-H0在手部运动生成与指令跟随方面表现卓越，且随着模型与数据规模的扩大展现出良好的扩展性。尤为重要的是，我们观察到在应用物理指令调优后，Being-H0在实际机器人操控任务中取得了预期中的性能提升。更多详情请访问https://beingbeyond.github.io/Being-H0。

English

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Being-H0：基于大规模人类视频的视觉-语言-动作预训练模型

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

摘要

Support