残差潜在行動による視覚特徴ベースの世界モデルの学習

要旨

世界モデルは観測と行動から未来の遷移を予測する。既存研究の大半は画像生成のみに焦点を当てている。一方、視覚特徴量に基づく世界モデルは、生のビデオピクセルではなく将来の視覚特徴量を予測するため、より効率的で幻覚を生じにくい有望な代替手段となる。しかし、現在の特徴量ベースのアプローチは直接回帰に依存しており、複雑な相互作用においてぼやけた予測や崩壊した予測を招く一方、高次元特徴空間での生成的モデリングは依然として困難である。本研究では、**残差潜在行動**（Residual Latent Action, RLA）と呼ばれる新しいタイプの潜在行動表現が、DINO残差から容易に学習可能であることを発見する。また、RLAが予測可能性、汎化可能性を持ち、時間的進行をエンコードすることを示す。RLAに基づき、フローマッチングによりRLA値を予測する**RLA世界モデル**（RLA-WM）を提案する。RLA-WMは、シミュレーションおよび実世界データセットにおいて、最先端の特徴量ベース世界モデルおよびビデオ拡散世界モデルの両方を凌駕し、ビデオ拡散よりも桁違いに高速である。さらに、RLA-WMを用いて方策学習を改善する二つのロボット学習手法を開発する。第一に、行動なしのデモンストレーションビデオから学習する、RLAを用いた最小限の世界行動モデルである。第二に、オフラインビデオのみから学習された世界モデルの内部で完全に訓練され、ビデオに整合した報酬を用い、オンライン相互作用や手作業による報酬を必要としない、初の視覚的強化学習フレームワークである。プロジェクトページ：https://mlzxy.github.io/rla-wm

English

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

残差潜在行動による視覚特徴ベースの世界モデルの学習

Learning Visual Feature-Based World Models via Residual Latent Action

要旨

Support