ChatPaper.aiChatPaper

Divot:扩散动力视频分词器用于理解和生成

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

December 5, 2024
作者: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
cs.AI

摘要

近年来,人们对在大型语言模型(LLMs)中统一图像理解和生成表现出了极大的兴趣。这种日益增长的兴趣促使我们探索将这种统一扩展到视频领域。核心挑战在于开发一种多才多艺的视频分词器,能够捕捉视频的空间特征和时间动态,以获得LLMs的表示,进而将这些表示进一步解码为逼真的视频片段,实现视频生成。在这项工作中,我们介绍了Divot,一种基于扩散的视频分词器,利用自监督视频表示学习的扩散过程。我们假设,如果一个视频扩散模型能够通过以视频分词器的特征作为条件有效去噪视频片段,那么该分词器已成功捕捉到稳健的空间和时间信息。此外,视频扩散模型本质上充当解词器,从其表示中解码视频。基于Divot分词器,我们通过视频到文本自回归和文本到视频生成,通过用高斯混合模型对Divot特征的连续值分布进行建模,提出了Divot-Vicuna。实验结果表明,我们基于扩散的视频分词器,当与预训练的LLM集成时,在各种视频理解和生成基准测试中取得了竞争性能。经过调整的Divot-Vicuna在视频叙事方面表现出色,生成交错的叙述和相应的视频。
English
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Summary

AI-Generated Summary

PDF162December 10, 2024