LightGen: 知識蒸留と直接選好最適化による効率的な画像生成

要旨

テキストから画像を生成する技術の最近の進展は、主に大規模なデータセットとパラメータ数の多いアーキテクチャに依存してきました。これらの要件は、十分な計算リソースを持たない研究者や実務者にとってアクセシビリティを大幅に制限しています。本論文では、知識蒸留（KD）とDirect Preference Optimization（DPO）を活用した効率的な画像生成モデルのトレーニングパラダイムである\modelを紹介します。マルチモーダル大規模言語モデル（MLLM）で広く採用されているデータKD技術の成功に着想を得て、LightGenは最先端（SOTA）のテキストから画像を生成するモデルの知識を、わずか0.7BパラメータのコンパクトなMasked Autoregressive（MAR）アーキテクチャに蒸留します。多様なキャプションから生成された200万枚の高品質な画像からなるコンパクトな合成データセットを使用し、モデルの性能を決定する上でデータの多様性がデータ量を大幅に上回ることを実証します。この戦略により、計算需要が劇的に削減され、事前学習時間が潜在的に数千GPU日からわずか88GPU日に短縮されます。さらに、合成データに内在する欠点、特に高周波ディテールの不足や空間的な不正確さに対処するため、画像の忠実度と位置精度を向上させるDPO技術を統合します。包括的な実験により、LightGenがSOTAモデルに匹敵する画像生成品質を達成しつつ、計算リソースを大幅に削減し、リソースが限られた環境でのアクセシビリティを拡大することが確認されました。コードはhttps://github.com/XianfengWu01/LightGenで公開されています。

English

Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only 0.7B parameters. Using a compact synthetic dataset of just 2M high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen

LightGen: 知識蒸留と直接選好最適化による効率的な画像生成

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

要旨

Support