DINO-WM：在預訓練視覺特徵上的世界模型實現零樣本規劃

摘要

預測未來結果在物理推理中至關重要。然而，這種被稱為世界模型的預測模型往往難以學習，通常僅針對特定任務解決方案進行開發，並伴隨著在線策略學習。我們認為世界模型的真正潛力在於其能夠僅使用被動數據來進行跨問題的推理和規劃。具體而言，我們要求世界模型具備以下三個特性：1）能夠在離線預先收集的軌跡上進行訓練，2）支持測試時行為優化，3）促進任務不可知的推理。為實現這一目標，我們提出了DINO世界模型（DINO-WM），這是一種新的方法，用於建模視覺動態，而無需重建視覺世界。DINO-WM利用使用DINOv2預先訓練的空間補丁特徵，使其能夠通過預測未來補丁特徵來從離線行為軌跡中學習。這種設計使DINO-WM能夠通過將所需目標補丁特徵視為預測目標，通過行動序列優化實現觀察目標，從而促進任務不可知的行為規劃。我們在各個領域評估了DINO-WM，包括迷宮導航、桌面推動和粒子操作。我們的實驗表明，DINO-WM能夠在測試時生成零樣本行為解決方案，而無需依賴專家示範、獎勵建模或預先學習的逆向模型。值得注意的是，與先前的最新工作相比，DINO-WM表現出強大的泛化能力，適應各種任務系列，如任意配置的迷宮、具有不同物體形狀的推動操作以及多粒子場景。

English

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

DINO-WM：在預訓練視覺特徵上的世界模型實現零樣本規劃

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

摘要

Support