Being-H0：基於大規模人類影片的視覺-語言-動作預訓練

摘要

我們推出了Being-H0，這是一個基於大規模人類視頻訓練的靈巧視覺-語言-動作模型（VLA）。現有的VLA在處理需要高度靈巧性的複雜操作任務時表現不佳，並且在新場景和任務中的泛化能力較差，這主要歸因於它們依賴於存在顯著模擬到現實差距的合成數據或缺乏規模和多樣性的遙控演示。為了解決這一數據瓶頸，我們提出利用人類手部作為基礎操作器，充分利用網絡數據中豐富的靈巧性和可擴展性。我們的方法圍繞物理指令調優這一新穎的訓練範式展開，該範式結合了從人類視頻中進行的大規模VLA預訓練、用於三維推理的物理空間對齊，以及針對機器人任務的訓練後適應。此外，我們引入了一種部件級運動標記化方法，該方法實現了毫米級的重建精度，以建模精確的手部軌跡進行動作學習。為了支持我們提出的範式，我們進一步開發了一個全面的數據整理流程，該流程整合了多種來源——包括動作捕捉、虛擬現實和僅RGB視頻——形成了一個包含數百萬基於運動的指令實例的大規模數據集。我們通過實驗展示了Being-H0在手部運動生成和指令跟隨方面的卓越表現，並且它在模型和數據規模上的擴展性良好。重要的是，我們觀察到Being-H0在實際機器人操作中應用了物理指令調優後所帶來的預期增益。更多詳情請訪問https://beingbeyond.github.io/Being-H0。

English

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Being-H0：基於大規模人類影片的視覺-語言-動作預訓練

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

摘要

Support