PhysBrain 1.0 技术报告

摘要

视觉-语言-动作模型近年来取得了快速进展，但仅依赖机器人轨迹数据难以覆盖广泛物理理解所需的学习范围。PhysBrain 1.0探索了一条互补路径：在机器人适配之前，将大规模人类第一人称视频转化为结构化的物理常识监督信号。我们的数据引擎提取场景元素、空间动态、动作执行及深度感知关系，进而生成问答监督数据用于训练PhysBrain视觉-语言模型（VLMs）。所得物理先验知识通过一种保持能力且对语言敏感的适配设计，进一步迁移至VLA策略。在包括ERQA、PhysBench、SimplerEnv-WidowX、LIBERO和RoboCasa在内的多模态问答基准和具身控制基准上，PhysBrain 1.0取得了最优结果，并在SimplerEnv上展现出尤为出色的域外性能。这些结果表明，从人类交互视频中规模化提取物理常识，可为多模态理解向机器人动作的迁移提供有效桥梁。

English

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.