残差ベクトル量子化ベースのトークンを用いた効率的な生成モデリング

要旨

Residual Vector Quantization (RVQ)を使用した高忠実度生成について探究します。この量子化技術は、より深いトークンを使用することでデータの忠実度を高く保ちます。ただし、生成モデル内のトークン数を増やすと推論速度が遅くなります。このため、高忠実度サンプルを生成する効率的なRVQベースの離散拡散モデルであるResGenを紹介します。私たちの主要なアイデアは、個々のトークンではなく集合的なトークンのベクトル埋め込みを直接予測することです。さらに、提案されたトークンマスキングとマルチトークン予測手法が、離散拡散プロセスと変分推論を使用した原則に基づく確率的フレームワーク内で定式化できることを示します。私たちは、異なるモダリティにまたがる2つの困難なタスク、つまりImageNet 256x256における条件付き画像生成とゼロショットのテキスト音声合成において、提案手法の有効性と汎用性を検証します。実験結果は、ResGenが両方のタスクで自己回帰モデルを上回り、サンプリング速度を損なうことなく優れたパフォーマンスを提供することを示しています。さらに、RVQの深さをスケーリングすると、同様のサイズのベースラインモデルと比較して、生成モデルは生成忠実度が向上するか、サンプリング速度が向上します。プロジェクトページは、https://resgen-genai.github.io で入手できます。

English

We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at https://resgen-genai.github.io

残差ベクトル量子化ベースのトークンを用いた効率的な生成モデリング

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

要旨

Support