视频占用模型

摘要

我们介绍了一种新的视频预测模型系列，旨在支持下游控制任务。我们将这些模型称为视频占用模型（VOCs）。VOCs在紧凑的潜在空间中运行，因此无需对单个像素进行预测。与先前的潜在空间世界模型不同，VOCs直接预测未来状态的折扣分布，一步到位，避免了多步预测。我们展示了在构建视频预测模型以用于下游控制时，这两个特性都是有益的。代码可在https://github.com/manantomar/video-occupancy-models{github.com/manantomar/video-occupancy-models}获取。

English

We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at https://github.com/manantomar/video-occupancy-models{github.com/manantomar/video-occupancy-models}.

视频占用模型

Video Occupancy Models

摘要

Support