PhysBrain 1.0 技术报告

摘要

視覺-語言-動作模型已取得快速進展，但僅靠機器人軌跡數據對學習廣泛的物理理解仍提供了有限的覆蓋範圍。PhysBrain 1.0 探索了一條互補的路徑：在機器人適應之前，將大規模的人類第一人稱視角影片轉換為結構化的物理常識監督訊號。我們的數據引擎提取場景元素、空間動態、動作執行及深度感知關係，接著將其轉化為問答形式的監督數據，用於訓練 PhysBrain 視覺語言模型。所得到的物理先驗知識進一步透過一種保留能力且對語言敏感的適應設計，轉移到視覺-語言-動作策略中。在多模態問答基準與具身控制基準（包括 ERQA、PhysBench、SimplerEnv-WidowX、LIBERO 及 RoboCasa）上，PhysBrain 1.0 均達成最佳結果，且在 SimplerEnv 上展現出特別強大的跨領域表現。這些結果表明，從人類互動影片中擴展物理常識，可為多模態理解通向機器人動作提供一條有效的橋樑。

English

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.