LARP:使用學習到的自回歸生成先驗對影片進行標記化
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
October 28, 2024
作者: Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava
cs.AI
摘要
我們提出了LARP,一種新穎的影片分詞器,旨在克服當前用於自回歸(AR)生成模型的影片分詞方法的局限性。與直接將局部視覺補丁編碼為離散標記的傳統補丁式分詞器不同,LARP引入了一種全面的分詞方案,通過一組學習到的全面查詢從視覺內容中收集信息。這種設計使LARP能夠捕獲更全局和語義表示,而不僅僅局限於局部補丁級別的信息。此外,它通過支持任意數量的離散標記,實現了基於任務特定要求的自適應和高效分詞。為了將離散標記空間與下游AR生成任務對齊,LARP集成了一個輕量級AR變壓器作為訓練時的先驗模型,該模型在其離散潛在空間上預測下一個標記。通過在訓練期間將先驗模型納入,LARP學習了一個不僅為視頻重建優化而且在結構上更有利於自回歸生成的潛在空間。此外,這個過程為離散標記定義了一個順序,逐步在訓練期間將它們推向最佳配置,確保推理時更平滑和更準確的AR生成。全面的實驗證明了LARP的強大性能,在UCF101類條件影片生成基準上實現了最先進的FVD。LARP增強了AR模型與影片的兼容性,並為構建統一的高保真多模式大型語言模型(MLLMs)打開了潛力。
English
We present LARP, a novel video tokenizer designed to overcome limitations in
current video tokenization methods for autoregressive (AR) generative models.
Unlike traditional patchwise tokenizers that directly encode local visual
patches into discrete tokens, LARP introduces a holistic tokenization scheme
that gathers information from the visual content using a set of learned
holistic queries. This design allows LARP to capture more global and semantic
representations, rather than being limited to local patch-level information.
Furthermore, it offers flexibility by supporting an arbitrary number of
discrete tokens, enabling adaptive and efficient tokenization based on the
specific requirements of the task. To align the discrete token space with
downstream AR generation tasks, LARP integrates a lightweight AR transformer as
a training-time prior model that predicts the next token on its discrete latent
space. By incorporating the prior model during training, LARP learns a latent
space that is not only optimized for video reconstruction but is also
structured in a way that is more conducive to autoregressive generation.
Moreover, this process defines a sequential order for the discrete tokens,
progressively pushing them toward an optimal configuration during training,
ensuring smoother and more accurate AR generation at inference time.
Comprehensive experiments demonstrate LARP's strong performance, achieving
state-of-the-art FVD on the UCF101 class-conditional video generation
benchmark. LARP enhances the compatibility of AR models with videos and opens
up the potential to build unified high-fidelity multimodal large language
models (MLLMs).Summary
AI-Generated Summary