ChatPaper.aiChatPaper

GIVT:生成式無限詞彙轉換器

GIVT: Generative Infinite-Vocabulary Transformers

December 4, 2023
作者: Michael Tschannen, Cian Eastwood, Fabian Mentzer
cs.AI

摘要

我們介紹生成式無限詞彙轉換器(GIVT),它生成具有實值項目的向量序列,而不是來自有限詞彙的離散標記。為此,我們對僅解碼器變壓器提出了兩個令人驚訝的簡單修改:1)在輸入端,我們將有限詞彙查找表替換為輸入向量的線性投影;以及2)在輸出端,我們將對數預測(通常映射為分類分佈)替換為多變量高斯混合模型的參數。受到VQ-GAN和MaskGIT的圖像生成範式的啟發,其中變壓器用於建模VQ-VAE的離散潛在序列,我們使用GIVT來建模VAE的未量化實值潛在序列。當將GIVT應用於具有迭代遮罩建模的類別條件圖像生成時,我們展示了與MaskGIT競爭力的結果,而在用於因果建模時,我們的方法優於VQ-GAN和MaskGIT。最後,當將我們的方法應用於基於VAE變體的UViM框架的全景分割和深度估計時,我們獲得了具有競爭力的結果。
English
We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.
PDF131December 15, 2024