ActiveMimic：基于主动感知的第一人称视频预训练

摘要

以自我为中心的人类视频为机器人预训练提供了一种可扩展的替代数据源，然而基于此类视频预训练的模型始终不如基于机器人数据预训练的模型。我们将这一差距归因于一个缺失的信号——自我中心视频中的主动感知行为：人类在操作过程中会持续调整视点，导致摄像机运动，而标准流程将其视为噪声。为此，我们提出ActiveMimic，一种预训练框架，能够从单个穿戴式RGB摄像头恢复同步的摄像头和手腕轨迹，将摄像头运动建模为视点动作，并在面向目标机器人进行适应之前，从野外自我中心人类视频中联合学习主动感知与操作。实验表明，在具有不同主动感知需求的各类任务中，ActiveMimic始终优于基于人类视频预训练的基线模型，并达到与基于机器人数据预训练的最先进模型相当的性能。进一步分析证实，主动感知能力源于自我中心人类视频预训练而非机器人特定微调，从而确认主动感知是解锁自我中心人类视频用于机器人预训练的关键。

English

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.