PhysBrain 1.0 기술 보고서

초록

비전-언어-행동 모델은 빠르게 발전해 왔지만, 로봇 궤적만으로는 광범위한 물리적 이해를 학습하기에 제한적인 정보를 제공한다. PhysBrain 1.0은 로봇 적응 전에 대규모 인간 자기중심 비디오를 구조화된 물리적 상식 감독 신호로 변환하는 상보적 경로를 연구한다. 우리의 데이터 엔진은 장면 요소, 공간 역학, 행동 실행, 깊이 인식 관계를 추출한 후, 이를 질문-응답 감독 신호로 변환하여 PhysBrain VLM을 훈련한다. 이렇게 얻어진 물리적 사전 지식은 능력 보존 및 언어 민감 적응 설계를 통해 VLA 정책으로 추가 전이된다. ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, RoboCasa를 포함한 다중 모달 QA 벤치마크 및 구현 제어 벤치마크 전반에서 PhysBrain 1.0은 최첨단 결과를 달성했으며, 특히 SimplerEnv에서 뛰어난 도메인 외 성능을 보였다. 이러한 결과는 인간 상호작용 비디오로부터 물리적 상식을 확장하는 것이 다중 모달 이해에서 로봇 행동으로의 효과적인 다리를 제공할 수 있음을 시사한다.

English

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.