PhysBrain 1.0 技術報告書

要旨

視覚-言語-行動モデルは急速に進歩しているが、ロボットの軌跡だけでは広範な物理的理解を学習するためのカバレッジが限られている。PhysBrain 1.0は、補完的なルートとして、ロボット適応前に大規模な人間の一人称視点ビデオを構造化された物理的常識の教師信号に変換する方法を研究している。私たちのデータエンジンは、シーン要素、空間ダイナミクス、行動実行、深度を考慮した関係性を抽出し、それらをPhysBrain VLMを訓練するための質問応答型教師信号に変換する。得られた物理的先行知識は、能力を保持し言語に敏感な適応設計を通じて、さらにVLAポリシーへと転送される。ERQA、PhysBench、SimplerEnv-WidowX、LIBERO、RoboCasaなどのマルチモーダルQAベンチマークおよび具現化制御ベンチマークにおいて、PhysBrain 1.0はSOTA（最先端）の結果を達成し、特にSimplerEnvでのドメイン外性能が優れている。これらの結果は、人間のインタラクションビデオから物理的常識をスケールアップすることが、マルチモーダル理解からロボット行動への効果的な橋渡しを提供できることを示唆している。

English

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.