Being-H0: 大規模な人間の動画からの視覚-言語-行動事前学習

要旨

我々は、大規模な人間のビデオデータで訓練された高度な器用さを持つVision-Language-Actionモデル（VLA）であるBeing-H0を紹介する。既存のVLAは、高度な器用さを必要とする複雑な操作タスクに苦戦し、新しいシナリオやタスクへの汎化が不十分である。これは主に、シミュレーションと現実の間に大きなギャップがある合成データや、規模と多様性に欠ける遠隔操作デモンストレーションに依存しているためである。このデータのボトルネックを解決するため、我々は人間の手を基盤としたマニピュレータとして活用し、ウェブデータに存在する豊富な器用さとスケーラビリティを利用することを提案する。我々のアプローチは、物理的指示チューニングという新しい訓練パラダイムを中心としており、人間のビデオからの大規模なVLA事前学習、3D推論のための物理空間アラインメント、ロボットタスクのための事後訓練適応を組み合わせている。さらに、ミリメートルレベルの再構成精度を達成するパートレベルのモーショントークン化手法を導入し、正確な手の軌跡をモデル化して行動学習を行う。提案するパラダイムをサポートするため、モーションキャプチャ、VR、RGBのみのビデオなど、異種のデータソースを統合し、数百万のモーションベースの指示インスタンスを含む大規模なデータセットを作成する包括的なデータキュレーションパイプラインを開発した。我々は、Being-H0が手の動き生成と指示追従において優れていることを実証し、モデルサイズとデータサイズに応じて良好にスケールすることも示した。重要なことに、物理的指示チューニングが適用されることで、Being-H0が現実世界のロボット操作において期待される成果を上げることを観察した。詳細はhttps://beingbeyond.github.io/Being-H0で確認できる。

English

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Being-H0: 大規模な人間の動画からの視覚-言語-行動事前学習

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

要旨

Support