主題-VQ-VAE:利用潛在碼書進行靈活的主題導向文件生成
Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation
December 15, 2023
作者: YoungJoon Yoo, Jongwon Choi
cs.AI
摘要
本文介紹了一種新穎的主題建模方法,利用從向量量化變分自編碼器(VQ-VAE)中提取的潛在碼書,離散地封裝了預先訓練的嵌入,如預先訓練的語言模型中所包含的豐富信息。通過將潛在碼書和嵌入解釋為概念詞袋的新方法,我們提出了一種名為主題向量量化變分自編碼器(TVQ-VAE)的新生成主題模型,逆向生成與相應潛在碼書相關的原始文件。TVQ-VAE可以用各種生成分佈來視覺化主題,包括傳統的詞袋分佈和自回歸圖像生成。我們在文件分析和圖像生成方面的實驗結果表明,TVQ-VAE有效地捕捉了顯示數據集的潛在結構並支持靈活的文件生成形式。提出的TVQ-VAE的官方實現可在https://github.com/clovaai/TVQ-VAE找到。
English
This paper introduces a novel approach for topic modeling utilizing latent
codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely
encapsulating the rich information of the pre-trained embeddings such as the
pre-trained language model. From the novel interpretation of the latent
codebooks and embeddings as conceptual bag-of-words, we propose a new
generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates
the original documents related to the respective latent codebook. The TVQ-VAE
can visualize the topics with various generative distributions including the
traditional BoW distribution and the autoregressive image generation. Our
experimental results on document analysis and image generation demonstrate that
TVQ-VAE effectively captures the topic context which reveals the underlying
structures of the dataset and supports flexible forms of document generation.
Official implementation of the proposed TVQ-VAE is available at
https://github.com/clovaai/TVQ-VAE.