PhysBrain:以人类自我中心数据为桥梁,从视觉语言模型迈向物理智能
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
December 18, 2025
作者: Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen
cs.AI
摘要
机器人泛化能力依赖于物理智能:即在具身感知与行动中,推理状态变化、密集接触交互和长时序规划的能力。然而,大多数视觉语言模型主要基于第三人称数据训练,导致人形机器人存在根本性的视角错配。由于成本高昂和多样性有限,规模化采集机器人本体视角数据仍不现实,而大规模人类第一人称视频自然包含了丰富的交互情境与因果结构,可作为可扩展的替代方案。核心挑战在于如何将原始第一人称视频转化为结构化、可靠的具身训练监督信号。为此,我们提出Egocentric2Embodiment转换框架,通过证据锚定与时序一致性约束,将第一人称视频转化为多层级、模式驱动的视觉问答监督数据,从而规模化构建E2E-3M数据集。基于该数据集训练得到的具身智能模型PhysBrain,在EgoThink任务中展现出显著增强的第一人称理解能力,特别是规划方面。该模型提供的本体感知初始化权重,可实现更高效的视觉语言动作模型微调,并在SimplerEnv环境中获得53.9%的成功率,证明了人类第一人称监督信号向下游机器人控制的有效迁移。
English
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.