Topic-VQ-VAE: Benutten van Latente Codeboeken voor Flexibele, Onderwerp-Gestuurde Documentgeneratie

Samenvatting

Dit artikel introduceert een nieuwe benadering voor topic modeling die gebruik maakt van latente codeboeken van een Vector-Quantized Variational Auto-Encoder (VQ-VAE), waarbij de rijke informatie van vooraf getrainde embeddings, zoals die van een vooraf getraind taalmodel, discreet wordt vastgelegd. Door de latente codeboeken en embeddings op een nieuwe manier te interpreteren als conceptuele bag-of-words, stellen we een nieuw generatief topic model voor genaamd Topic-VQ-VAE (TVQ-VAE), dat de originele documenten die gerelateerd zijn aan het respectievelijke latente codeboek omgekeerd genereert. De TVQ-VAE kan de topics visualiseren met verschillende generatieve distributies, waaronder de traditionele BoW-distributie en autoregressieve beeldgeneratie. Onze experimentele resultaten op het gebied van documentanalyse en beeldgeneratie tonen aan dat TVQ-VAE effectief de topiccontext vastlegt, wat de onderliggende structuren van de dataset onthult en flexibele vormen van documentgeneratie ondersteunt. De officiële implementatie van de voorgestelde TVQ-VAE is beschikbaar op https://github.com/clovaai/TVQ-VAE.

English

This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.

Topic-VQ-VAE: Benutten van Latente Codeboeken voor Flexibele, Onderwerp-Gestuurde Documentgeneratie

Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation

Samenvatting

Support