Being-H0: 대규모 인간 비디오 데이터를 활용한 비전-언어-행동 사전 학습

초록

우리는 대규모 인간 동영상 데이터로 학습된 고도로 정교한 시각-언어-행동 모델(VLA)인 Being-H0를 소개한다. 기존 VLA 모델들은 높은 정교함을 요구하는 복잡한 조작 작업에 어려움을 겪으며, 새로운 시나리오와 작업으로의 일반화가 부족한데, 이는 주로 시뮬레이션 데이터의 현실 간극(sim-to-real gap)이나 규모와 다양성이 부족한 원격 조작 데모에 의존하기 때문이다. 이러한 데이터 병목 현상을 해결하기 위해, 우리는 웹 데이터에 풍부하게 존재하는 정교성과 확장성을 활용하여 인간의 손을 기초 조작자로 삼는 접근법을 제안한다. 우리의 접근법은 물리적 명령 튜닝(physical instruction tuning)을 중심으로 하는데, 이는 대규모 VLA 사전 학습, 3D 추론을 위한 물리적 공간 정렬, 로봇 작업을 위한 사후 학습 적응을 결합한 새로운 훈련 패러다임이다. 또한, 정확한 손 동작 궤적을 모델링하기 위해 밀리미터 수준의 재구성 정확도를 달성하는 부위별 동작 토큰화(part-level motion tokenization) 방법을 도입했다. 제안된 패러다임을 지원하기 위해, 우리는 모션 캡처, VR, RGB 단독 동영상 등 다양한 소스를 통합하여 수백만 개의 동작 기반 명령 인스턴스로 구성된 대규모 데이터셋을 구축하는 포괄적인 데이터 큐레이션 파이프라인을 개발했다. 실험적으로 Being-H0가 손 동작 생성과 명령 수행에서 탁월한 성능을 보이며, 모델 및 데이터 크기에 따라 잘 확장됨을 입증했다. 특히, 물리적 명령 튜닝이 적용됨에 따라 Being-H0가 실제 로봇 조작에서 기대되는 성능 향상을 보이는 것을 관찰했다. 더 자세한 내용은 https://beingbeyond.github.io/Being-H0에서 확인할 수 있다.

English

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Being-H0: 대규모 인간 비디오 데이터를 활용한 비전-언어-행동 사전 학습

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

초록

Support