ActiveMimic: 능동적 지각을 통한 자아 중심 비디오 사전 학습

초록

자기중심적 인간 비디오는 로봇 데이터에 대한 확장 가능한 대안을 사전 학습에 제공하지만, 이러한 비디오로 사전 학습된 모델은 로봇 데이터로 사전 학습된 모델에 비해 일관되게 성능이 떨어진다. 우리는 이러한 격차를 누락된 신호, 즉 자기중심적 비디오에서 인간이 조작 중 지속적으로 시점을 재조정하여 표준 파이프라인이 노이즈로 처리하는 카메라 움직임을 유발하는 능동적 지각 행동 때문이라고 본다. 이를 해결하기 위해, 우리는 단일 신체 부착 RGB 카메라에서 동기화된 카메라 및 손목 궤적을 복구하고, 카메라 움직임을 시점 행동으로 모델링하며, 실제 환경의 자기중심적 인간 비디오로부터 능동적 지각과 조작을 공동으로 학습한 후 목표 로봇에 적용하는 사전 학습 프레임워크인 ActiveMimic을 제시한다. 실증적으로, 다양한 능동적 지각 요구를 가진 작업에 걸친 실제 실험에서 ActiveMimic은 인간 비디오로 사전 학습된 기준선을 일관되게 능가하고, 로봇 데이터로 사전 학습된 최신 모델과 일치하는 성능을 보인다. 추가 분석은 능동적 지각 능력이 로봇 특화 미세 조정이 아닌 자기중심적 인간 비디오 사전 학습에서 비롯됨을 입증하며, 능동적 지각이 로봇 사전 학습을 위한 자기중심적 인간 비디오의 활용을 가능하게 하는 핵심 요소임을 확인한다.

English

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.