主題-VQ-VAE：利用潛在碼書進行靈活的主題導向文件生成

摘要

本文介紹了一種新穎的主題建模方法，利用從向量量化變分自編碼器（VQ-VAE）中提取的潛在碼書，離散地封裝了預先訓練的嵌入，如預先訓練的語言模型中所包含的豐富信息。通過將潛在碼書和嵌入解釋為概念詞袋的新方法，我們提出了一種名為主題向量量化變分自編碼器（TVQ-VAE）的新生成主題模型，逆向生成與相應潛在碼書相關的原始文件。TVQ-VAE可以用各種生成分佈來視覺化主題，包括傳統的詞袋分佈和自回歸圖像生成。我們在文件分析和圖像生成方面的實驗結果表明，TVQ-VAE有效地捕捉了顯示數據集的潛在結構並支持靈活的文件生成形式。提出的TVQ-VAE的官方實現可在https://github.com/clovaai/TVQ-VAE找到。

English

This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.

主題-VQ-VAE：利用潛在碼書進行靈活的主題導向文件生成

Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation

摘要

Support