Le Tokenizer d'Images Nécessite un Post-Entraînement

papers.abstract

Les modèles génératifs d'images récents capturent généralement la distribution des images dans un espace latent pré-construit, en s'appuyant sur un tokenizer d'images figé. Cependant, il existe un écart significatif entre la distribution de reconstruction et la distribution de génération, où les tokenizers actuels ne priorisent que la tâche de reconstruction qui se produit avant l'entraînement génératif, sans tenir compte des erreurs de génération lors de l'échantillonnage. Dans cet article, nous analysons de manière exhaustive la raison de cet écart dans un espace latent discret, et, à partir de cela, nous proposons un nouveau schéma d'entraînement de tokenizer incluant à la fois un entraînement principal et un post-entraînement, se concentrant respectivement sur l'amélioration de la construction de l'espace latent et du décodage. Pendant l'entraînement principal, une stratégie de perturbation latente est proposée pour simuler les bruits d'échantillonnage, c'est-à-dire les tokens inattendus générés lors de l'inférence générative. Plus précisément, nous proposons un schéma d'entraînement de tokenizer plug-and-play, qui améliore significativement la robustesse du tokenizer, augmentant ainsi la qualité de génération et la vitesse de convergence, ainsi qu'une nouvelle métrique d'évaluation de tokenizer, à savoir le pFID, qui corrèle avec succès la performance du tokenizer à la qualité de génération. Pendant le post-entraînement, nous optimisons davantage le décodeur du tokenizer par rapport à un modèle génératif bien entraîné pour atténuer la différence de distribution entre les tokens générés et reconstruits. Avec un générateur sim400M, un tokenizer discret entraîné avec notre entraînement principal atteint un gFID notable de 1,60 et obtient ensuite un gFID de 1,36 avec le post-entraînement supplémentaire. Des expériences supplémentaires sont menées pour valider largement l'efficacité de notre stratégie de post-entraînement sur des tokenizers discrets et continus prêts à l'emploi, couplés à des générateurs autoregressifs et basés sur la diffusion.

English

Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a sim400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.

Le Tokenizer d'Images Nécessite un Post-Entraînement

Image Tokenizer Needs Post-Training

papers.abstract

Support