主题-VQ-VAE：利用潜在码书实现灵活的主题引导文档生成

摘要

本文介绍了一种新颖的主题建模方法，利用向量量化变分自动编码器（VQ-VAE）中的潜在码书，离散地封装了预训练嵌入（如预训练语言模型）的丰富信息。通过将潜在码书和嵌入解释为概念词袋的新颖方式，我们提出了一种名为主题-VQ-VAE（TVQ-VAE）的新生成主题模型，该模型可以反向生成与相应潜在码书相关的原始文档。TVQ-VAE可以可视化具有各种生成分布的主题，包括传统的词袋分布和自回归图像生成。我们在文档分析和图像生成方面的实验结果表明，TVQ-VAE有效捕捉了揭示数据集潜在结构并支持灵活文档生成形式的主题上下文。提出的TVQ-VAE的官方实现可在https://github.com/clovaai/TVQ-VAE找到。

English

This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.

主题-VQ-VAE：利用潜在码书实现灵活的主题引导文档生成

Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation

摘要

Support