VideoFlexTok: 柔軟な長さの粗い粒度から細かい粒度へのビデオトークン化

要旨

視覚トークナイザーは、高次元の生ピクセルを下流モデリング用の圧縮表現に写像する。圧縮機能に加え、トークナイザーは保存される情報とその構造を決定する。映像トークン化のデファクトスタンダードである時空間3Dグリッド方式では、映像は各トークンが元信号の対応する局所情報を捕捉する形で表現される。これにより、テキストから映像を生成するモデルなどの下流モデルは、映像の内在的複雑度に関わらず「ピクセル単位」で全ての低レベル詳細を予測することを学習する必要が生じ、学習負荷が高くなる。本研究ではVideoFlexTokを提案する。これは可変長トークン列で映像を粗密階層的に表現するもので、初期トークンが（創発的に）意味情報や動きなどの抽象情報を捕捉し、後続トークンが微細な詳細を付加する。生成フローデコーダにより、任意のトークン数から写実的な映像再構成を可能とする。この表現構造により、下流タスクに応じたトークン数の柔軟な調整や、同一予算でベースラインより長い映像の符号化が実現する。クラス条件付きおよびテキスト条件付き映像生成タスクによる評価では、3Dグリッドトークンと比較して学習効率が向上し、例えばモデル規模を5分の1（11億パラメータ対52億パラメータ）に縮小しながら同等の生成品質（gFVD及びViCLIPスコア）を達成した。さらに、10秒81フレームの映像を従来比8分の1の672トークンで処理するテキストから映像へのモデル学習を実証し、計算コストを抑制した長尺映像生成の可能性を示す。

English

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

VideoFlexTok: 柔軟な長さの粗い粒度から細かい粒度へのビデオトークン化

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

要旨

Support