ActiveMimic：基於主動感知的第一人稱視頻預訓練

摘要

以自我為中心的人類影片為預訓練提供了一種可擴展的機器人數據替代方案，然而，在此類影片上預訓練的模型始終表現不如在機器人數據上預訓練的模型。我們將此差距歸因於一個缺失的信號——自我中心影片中的主動感知行為，其中人類在操作過程中不斷重新定位自身視角，導致標準處理流程將其視為雜訊的攝影機運動。為了解決這個問題，我們提出ActiveMimic，一個預訓練框架，能從單一穿戴式RGB攝影機中恢復同步的攝影機與手腕軌跡，將攝影機運動建模為視角行動，並從野外自我中心人類影片中共同學習主動感知與操作，再適應至目標機器人。實證上，跨越多項具不同主動感知需求任務的真實世界實驗顯示，ActiveMimic始終優於在人類影片上預訓練的基準方法，並能與在機器人數據上預訓練的最新模型表現相當。進一步分析提供的證據表明，主動感知能力源自自我中心人類影片的預訓練，而非機器人專屬的微調，從而確認主動感知是解鎖自我中心人類影片用於機器人預訓練的關鍵。

English

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.