GIVT: 생성형 무한 어휘 변환기

초록

본 논문에서는 유한 어휘 집합에서의 이산적 토큰 대신 실수 값을 갖는 벡터 시퀀스를 생성하는 생성적 무한 어휘 트랜스포머(Generative Infinite-Vocabulary Transformers, GIVT)를 소개한다. 이를 위해 디코더 전용 트랜스포머에 두 가지 간단한 수정을 제안한다: 1) 입력 단계에서 유한 어휘 조회 테이블을 입력 벡터의 선형 투영으로 대체하고, 2) 출력 단계에서 범주형 분포로 매핑되던 로짓 예측을 다변량 가우시안 혼합 모델의 파라미터 예측으로 대체한다. VQ-GAN과 MaskGIT의 이미지 생성 패러다임에서 트랜스포머가 VQ-VAE의 이산적 잠재 시퀀스를 모델링하는 방식에서 영감을 받아, GIVT는 VAE의 양자화되지 않은 실수 값 잠재 시퀀스를 모델링하는 데 사용된다. GIVT를 반복적 마스크 모델링을 통한 클래스 조건부 이미지 생성에 적용할 때, MaskGIT와 경쟁력 있는 결과를 보이며, 특히 인과적 모델링에서는 VQ-GAN과 MaskGIT를 모두 능가하는 성능을 보인다. 마지막으로, UViM 프레임워크의 VAE 기반 변형을 통해 파노픽 세그멘테이션과 깊이 추정에 적용할 때 이미지 생성 외의 영역에서도 경쟁력 있는 결과를 얻는다.

English

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.

GIVT: 생성형 무한 어휘 변환기

GIVT: Generative Infinite-Vocabulary Transformers

초록

Support