ChatPaper.aiChatPaper

PhysBrain 1.0 技术报告

PhysBrain 1.0 Technical Report

May 14, 2026
作者: Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan, Xiaolin Hu, Zhaolong Shen, Yuzhuo Miao, Haishan Liu, Yuxuan Tian, Yukun Shi, Cong Huang, Kai Chen
cs.AI

摘要

视觉-语言-动作模型近年来取得了快速进展,但仅依赖机器人轨迹数据难以覆盖广泛物理理解所需的学习范围。PhysBrain 1.0探索了一条互补路径:在机器人适配之前,将大规模人类第一人称视频转化为结构化的物理常识监督信号。我们的数据引擎提取场景元素、空间动态、动作执行及深度感知关系,进而生成问答监督数据用于训练PhysBrain视觉-语言模型(VLMs)。所得物理先验知识通过一种保持能力且对语言敏感的适配设计,进一步迁移至VLA策略。在包括ERQA、PhysBench、SimplerEnv-WidowX、LIBERO和RoboCasa在内的多模态问答基准和具身控制基准上,PhysBrain 1.0取得了最优结果,并在SimplerEnv上展现出尤为出色的域外性能。这些结果表明,从人类交互视频中规模化提取物理常识,可为多模态理解向机器人动作的迁移提供有效桥梁。
English
Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.