GIVT:生成无限词汇变换器
GIVT: Generative Infinite-Vocabulary Transformers
December 4, 2023
作者: Michael Tschannen, Cian Eastwood, Fabian Mentzer
cs.AI
摘要
我们介绍生成无限词汇变压器(GIVT),它生成具有实值条目的向量序列,而不是来自有限词汇的离散标记。为此,我们对仅解码器变压器提出了两个令人惊讶的简单修改:1)在输入端,我们用输入向量的线性投影替换有限词汇查找表;2)在输出端,我们用多元高斯混合模型的参数替换对数预测(通常映射到分类分布)。受 VQ-GAN 和 MaskGIT 的图像生成范式启发,其中变压器用于建模 VQ-VAE 的离散潜在序列,我们使用 GIVT 来建模 VAE 的未量化实值潜在序列。将 GIVT 应用于具有迭代掩模建模的类别条件图像生成时,我们展示了与 MaskGIT 的竞争结果,而在用于因果建模时,我们的方法优于 VQ-GAN 和 MaskGIT。最后,当将我们的方法应用于基于 VAE 变体的 UViM 框架的全景分割和深度估计时,我们获得了具有竞争力的结果。
English
We introduce generative infinite-vocabulary transformers (GIVT) which
generate vector sequences with real-valued entries, instead of discrete tokens
from a finite vocabulary. To this end, we propose two surprisingly simple
modifications to decoder-only transformers: 1) at the input, we replace the
finite-vocabulary lookup table with a linear projection of the input vectors;
and 2) at the output, we replace the logits prediction (usually mapped to a
categorical distribution) with the parameters of a multivariate Gaussian
mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT,
where transformers are used to model the discrete latent sequences of a VQ-VAE,
we use GIVT to model the unquantized real-valued latent sequences of a VAE.
When applying GIVT to class-conditional image generation with iterative masked
modeling, we show competitive results with MaskGIT, while our approach
outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally,
we obtain competitive results outside of image generation when applying our
approach to panoptic segmentation and depth estimation with a VAE-based variant
of the UViM framework.