MilliVid: 视频生成中长程一致性的分层潜在变量

摘要

视频生成模型的能力日益增强，但长程一致性仍难以实现，因为即使只有几十帧，也需要不切实际的长Transformer序列长度。我们证明，这一问题可通过在多尺度token空间内采用粗到细展开的方式生成视频来缓解。我们的方法简单直接：首先，预训练一个自编码器，将每一帧压缩为层级化的token结构，其层级从典型的潜在分辨率直至每帧仅含少量token。最粗的层级捕获最具影响力的信息，如场景布局和语义，而更细的层级则增加高频外观与纹理。随后，我们训练一个视频扩散模型，通过粗到细展开生成这些token。通过精心控制在每次展开步骤中生成帧所用的细节层次及其作为上下文的范围，我们得以保持几何形状和物体永久性方面的长程一致性，同时将计算资源更多地投入到对感知影响较小的细节上。我们使用一个包含大量Minecraft长视频的自定义数据集验证了该方法，结果表明，与现有基线相比，该方法生成了更一致的展开结果。

English

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.