Topic-VQ-VAE: 잠재 코드북을 활용한 주제 기반 문서 생성의 유연성 확보

초록

본 논문은 사전 학습된 언어 모델과 같은 임베딩의 풍부한 정보를 이산적으로 캡슐화하는 Vector-Quantized Variational Auto-Encoder(VQ-VAE)의 잠재 코드북을 활용한 새로운 토픽 모델링 접근법을 소개합니다. 잠재 코드북과 임베딩을 개념적 Bag-of-Words로 해석하는 새로운 관점에서, 우리는 해당 잠재 코드북과 관련된 원본 문서를 역으로 생성하는 새로운 생성적 토픽 모델인 Topic-VQ-VAE(TVQ-VAE)를 제안합니다. TVQ-VAE는 전통적인 BoW 분포와 자기회귀적 이미지 생성 등 다양한 생성 분포를 통해 토픽을 시각화할 수 있습니다. 문서 분석 및 이미지 생성에 대한 실험 결과는 TVQ-VAE가 데이터셋의 기본 구조를 드러내는 토픽 컨텍스트를 효과적으로 포착하며 유연한 형태의 문서 생성을 지원함을 보여줍니다. 제안된 TVQ-VAE의 공식 구현은 https://github.com/clovaai/TVQ-VAE에서 확인할 수 있습니다.

English

This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.

Topic-VQ-VAE: 잠재 코드북을 활용한 주제 기반 문서 생성의 유연성 확보

Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation

초록

Support