GIVT：生成无限词汇变换器

摘要

我们介绍生成无限词汇变压器（GIVT），它生成具有实值条目的向量序列，而不是来自有限词汇的离散标记。为此，我们对仅解码器变压器提出了两个令人惊讶的简单修改：1）在输入端，我们用输入向量的线性投影替换有限词汇查找表；2）在输出端，我们用多元高斯混合模型的参数替换对数预测（通常映射到分类分布）。受 VQ-GAN 和 MaskGIT 的图像生成范式启发，其中变压器用于建模 VQ-VAE 的离散潜在序列，我们使用 GIVT 来建模 VAE 的未量化实值潜在序列。将 GIVT 应用于具有迭代掩模建模的类别条件图像生成时，我们展示了与 MaskGIT 的竞争结果，而在用于因果建模时，我们的方法优于 VQ-GAN 和 MaskGIT。最后，当将我们的方法应用于基于 VAE 变体的 UViM 框架的全景分割和深度估计时，我们获得了具有竞争力的结果。

English

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.

GIVT：生成无限词汇变换器

GIVT: Generative Infinite-Vocabulary Transformers

摘要

Support