MilliVid：視頻生成中長程一致性的層次化潛變量

摘要

视频生成模型的能力日益强大，但长程一致性仍难以实现，这是因为即便仅需生成几十帧画面，所需的Transformer序列长度也过长，在实际应用中难以实现。我们证明，通过在多尺度标记空间内采用由粗到细的逐步生成方式，这一难题可得到缓解。我们的方法简单直观：首先，预训练一个自编码器，将每一帧压缩成由粗到细的多层标记层级——从常规的潜在分辨率级别，直至每帧仅包含极少量标记的极粗层级。最粗的层级捕捉场景布局和语义等关键信息，而更细的层级则补充高频的表征与纹理细节。随后，我们训练一个视频扩散模型，以由粗到细的逐步生成方式产出这些标记。通过精心控制在每个生成步骤中帧级细节的呈现程度及其作为上下文的使用方式，我们得以在保证几何结构一致性与物体恒常性的同时，减少对感知上不显著的细节进行长程一致性建模所需的计算开销。我们采用一个包含大量《我的世界》游戏视频的自定义数据集验证了该方法，结果表明，其生成的视频在连贯性上显著优于现有基线模型。

English

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.