PhysBrain:以人类自我中心数据为桥梁实现从视觉语言模型到物理智能的跨越
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
December 18, 2025
作者: Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen
cs.AI
摘要
機器人泛化能力依賴於物理智能:即在自我中心感知與行動框架下,進行狀態變化推理、密集接觸交互和長時程規劃的能力。然而,大多數視覺語言模型主要基於第三人稱數據進行訓練,這與人形機器人的視角存在根本性錯配。由於成本高昂和多樣性有限,擴展機器人自我中心數據採集仍不具可行性,而大規模人類第一視角視頻則能自然捕捉豐富的交互情境與因果結構,成為可擴展的替代方案。核心挑戰在於如何將原始第一視角視頻轉化為結構化、可靠的具身訓練監督信號。為此,我們提出Egocentric2Embodiment轉換流程,通過強制證據錨定與時序一致性,將第一視角視頻轉化為多層級、模式驅動的視覺問答監督信號,從而實現大規模Egocentric2Embodiment數據集(E2E-3M)的構建。基於E2E-3M數據集訓練得到的自我中心感知具身大腦PhysBrain,在EgoThink任務中展現出顯著提升的第一視角理解能力,特別是在規劃方面。該模型提供的自我中心感知初始化權重,既能實現更高效的視覺語言動作模型微調,又在SimplerEnv環境中獲得更高任務成功率(53.9%),證明了人類第一視角監督信號向下游機器人控制的有效遷移。
English
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.