GIVT: 生成型無限語彙トランスフォーマー

要旨

有限語彙の離散トークンではなく、実数値エントリを持つベクトル列を生成する生成型無限語彙トランスフォーマー（GIVT）を提案する。これにより、デコーダのみのトランスフォーマーに対して2つの驚くほど単純な修正を加える：1）入力において、有限語彙のルックアップテーブルを入力ベクトルの線形射影に置き換える；2）出力において、カテゴリカル分布に通常マッピングされるロジット予測を多変量ガウス混合モデルのパラメータに置き換える。VQ-GANやMaskGITの画像生成パラダイムにインスパイアされ、トランスフォーマーがVQ-VAEの離散潜在列をモデル化するのに対し、GIVTはVAEの非量子化された実数値潜在列をモデル化するために使用する。クラス条件付き画像生成に反復的マスクモデリングを適用する場合、GIVTはMaskGITと競合する結果を示し、因果モデリングに使用する場合にはVQ-GANとMaskGITの両方を上回る性能を発揮する。最後に、UViMフレームワークのVAEベースのバリアントを用いてパノプティックセグメンテーションと深度推定に適用する場合、画像生成以外の領域でも競争力のある結果を得る。

English

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.

GIVT: 生成型無限語彙トランスフォーマー

GIVT: Generative Infinite-Vocabulary Transformers

要旨

Support