統一視頻動作模型
Unified Video Action Model
February 28, 2025
作者: Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song
cs.AI
摘要
統一視訊與動作模型在機器人領域具有重要前景,其中視訊為動作預測提供了豐富的場景資訊,而動作則為視訊預測提供了動態資訊。然而,有效結合視訊生成與動作預測仍具挑戰性,且當前基於視訊生成的方法在動作準確性和推理速度上難以匹敵直接策略學習。為彌補這一差距,我們提出了統一視訊動作模型(UVA),該模型聯合優化視訊與動作預測,以實現高準確性和高效動作推理。其關鍵在於學習聯合視訊-動作潛在表徵並解耦視訊-動作解碼。聯合潛在表徵橋接了視覺與動作領域,有效建模了視訊與動作序列間的關係。同時,由兩個輕量級擴散頭驅動的解耦解碼,通過在推理過程中繞過視訊生成,實現了高速動作推理。此統一框架進一步通過遮罩輸入訓練實現了多功能性。通過選擇性地遮罩動作或視訊,單一模型可處理策略學習之外的多元任務,如正向與逆向動力學建模及視訊生成。通過一系列廣泛實驗,我們證明UVA可作為廣泛機器人任務的通用解決方案,如策略學習、正向/逆向動力學及視訊觀測預測,且與針對特定應用設計的方法相比,性能毫不遜色。最佳結果請參閱https://unified-video-action-model.github.io/。
English
A unified video and action model holds significant promise for robotics,
where videos provide rich scene information for action prediction, and actions
provide dynamics information for video prediction. However, effectively
combining video generation and action prediction remains challenging, and
current video generation-based methods struggle to match the performance of
direct policy learning in action accuracy and inference speed. To bridge this
gap, we introduce the Unified Video Action model (UVA), which jointly optimizes
video and action predictions to achieve both high accuracy and efficient action
inference. The key lies in learning a joint video-action latent representation
and decoupling video-action decoding. The joint latent representation bridges
the visual and action domains, effectively modeling the relationship between
video and action sequences. Meanwhile, the decoupled decoding, powered by two
lightweight diffusion heads, enables high-speed action inference by bypassing
video generation during inference. Such a unified framework further enables
versatile functionality through masked input training. By selectively masking
actions or videos, a single model can tackle diverse tasks beyond policy
learning, such as forward and inverse dynamics modeling and video generation.
Via an extensive set of experiments, we demonstrate that UVA can serve as a
general-purpose solution for a wide range of robotics tasks, such as policy
learning, forward/inverse dynamics and video observation prediction, without
compromising performance compared to methods tailored for specific
applications. Results are best viewed on
https://unified-video-action-model.github.io/.Summary
AI-Generated Summary