LARP：使用學習到的自回歸生成先驗對影片進行標記化

摘要

我們提出了LARP，一種新穎的影片分詞器，旨在克服當前用於自回歸（AR）生成模型的影片分詞方法的局限性。與直接將局部視覺補丁編碼為離散標記的傳統補丁式分詞器不同，LARP引入了一種全面的分詞方案，通過一組學習到的全面查詢從視覺內容中收集信息。這種設計使LARP能夠捕獲更全局和語義表示，而不僅僅局限於局部補丁級別的信息。此外，它通過支持任意數量的離散標記，實現了基於任務特定要求的自適應和高效分詞。為了將離散標記空間與下游AR生成任務對齊，LARP集成了一個輕量級AR變壓器作為訓練時的先驗模型，該模型在其離散潛在空間上預測下一個標記。通過在訓練期間將先驗模型納入，LARP學習了一個不僅為視頻重建優化而且在結構上更有利於自回歸生成的潛在空間。此外，這個過程為離散標記定義了一個順序，逐步在訓練期間將它們推向最佳配置，確保推理時更平滑和更準確的AR生成。全面的實驗證明了LARP的強大性能，在UCF101類條件影片生成基準上實現了最先進的FVD。LARP增強了AR模型與影片的兼容性，並為構建統一的高保真多模式大型語言模型（MLLMs）打開了潛力。

English

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

LARP：使用學習到的自回歸生成先驗對影片進行標記化

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

摘要

Support