自回归视频扩散模型的误差分析：一个统一框架

摘要

多种自回归视频扩散模型（ARVDM）在生成长视频方面取得了显著成功。然而，针对这些模型的理论分析仍然匮乏。在本研究中，我们为这些模型建立了理论基础，并利用我们的洞见来提升现有模型的性能。我们首先提出了Meta-ARVDM，这是一个统一框架，涵盖了大多数现有方法。通过Meta-ARVDM，我们分析了模型生成视频与真实视频之间的KL散度。我们的分析揭示了ARVDM固有的两个重要现象——误差累积和内存瓶颈。通过推导信息论上的不可能性结果，我们证明了内存瓶颈现象无法避免。为了缓解内存瓶颈，我们设计了多种网络结构，以显式地利用更多过去帧。同时，通过压缩帧，我们在缓解内存瓶颈与推理效率之间实现了显著改进的权衡。在DMLab和Minecraft上的实验结果验证了我们方法的有效性。我们的实验还展示了不同方法在误差累积与内存瓶颈之间的帕累托前沿。

English

A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.

自回归视频扩散模型的误差分析：一个统一框架

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

摘要

Support