利用異質性遮罩自回歸學習真實世界動作視頻動態
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression
February 6, 2025
作者: Lirui Wang, Kevin Zhao, Chaoqi Liu, Xinlei Chen
cs.AI
摘要
我們提出了異質遮罩自回歸(Heterogeneous Masked Autoregression,HMA)來建模動作影片動態,以生成高質量的數據並在擴展機器人學習中進行評估。為機器人技術建立互動式影片世界模型和策略是困難的,這是因為需要應對各種不同設置的挑戰,同時保持計算效率以實時運行。HMA利用異質預訓練,從不同機器人實體、領域和任務的觀察和動作序列中進行。HMA使用遮罩自回歸生成影片預測的量化或軟化標記。我們的方法在視覺保真度和可控性方面優於先前的機器人影片生成模型,在現實世界中運行速度快15倍。在後訓練後,這個模型可以作為從低級動作輸入生成合成數據並評估策略的影片模擬器。更多信息請參見此鏈接https://liruiw.github.io/hma。
English
We propose Heterogeneous Masked Autoregression (HMA) for modeling
action-video dynamics to generate high-quality data and evaluation in scaling
robot learning. Building interactive video world models and policies for
robotics is difficult due to the challenge of handling diverse settings while
maintaining computational efficiency to run in real time. HMA uses
heterogeneous pre-training from observations and action sequences across
different robotic embodiments, domains, and tasks. HMA uses masked
autoregression to generate quantized or soft tokens for video predictions.
\ourshort achieves better visual fidelity and controllability than the previous
robotic video generation models with 15 times faster speed in the real world.
After post-training, this model can be used as a video simulator from low-level
action inputs for evaluating policies and generating synthetic data. See this
link https://liruiw.github.io/hma for more information.Summary
AI-Generated Summary