LARP: 学習された自己回帰生成事前分布を用いたビデオのトークン化

要旨

我々は、LARPを提案する。これは、自己回帰生成モデル向けの現行のビデオトークナイゼーション手法の制限を克服するために設計された革新的なビデオトークナイザーである。従来のパッチ単位のトークナイザーが、視覚パッチを直接離散トークンにエンコードするのに対し、LARPは、学習された包括的なクエリのセットを使用して視覚コンテンツから情報を収集する包括的なトークナイゼーションスキームを導入している。この設計により、LARPは、局所のパッチレベルの情報に限定されるのではなく、よりグローバルで意味のある表現を捉えることができる。さらに、任意の数の離散トークンをサポートする柔軟性を提供し、タスクの特定の要件に基づいて適応的かつ効率的なトークナイゼーションを実現する。離散トークン空間を下流の自己回帰生成タスクに整合させるために、LARPは、軽量な自己回帰トランスフォーマーを統合して、トレーニング時の事前モデルとして次のトークンを離散的な潜在空間で予測する。トレーニング中に事前モデルを組み込むことで、LARPは、ビデオ再構成に最適化された潜在空間を学習するだけでなく、自己回帰生成にも適した構造になる。さらに、このプロセスにより、離散トークンに対する連続した順序が定義され、トレーニング中に最適な構成に逐次的に推進され、推論時によりスムーズで正確な自己回帰生成が保証される。包括的な実験により、LARPの強力なパフォーマンスが示され、UCF101クラス条件付きビデオ生成ベンチマークで最先端のFVDを達成している。LARPは、自己回帰モデルとビデオの互換性を高め、統一された高品質なマルチモーダル大規模言語モデル（MLLMs）の構築の可能性を開く。

English

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

LARP: 学習された自己回帰生成事前分布を用いたビデオのトークン化

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

要旨

Support