Bifrost-1: パッチレベルのCLIP潜在表現を用いたマルチモーダルLLMと拡散モデルの統合

要旨

高忠実度な視覚合成能力を大規模言語モデル（LLMs）に統合しつつ、その強力な推論能力を損なわないことに対する関心が高まっている。既存の手法では、LLMsを直接訓練するか、LLMsと拡散モデルを橋渡しする方法が一般的であるが、バックボーンのLLMsが事前学習中に画像表現を見ていないため、訓練コストが高くなる傾向がある。本論文では、Bifrost-1を提案する。これは、事前学習済みのマルチモーダルLLMs（MLLMs）と拡散モデルを、パッチレベルのCLIP画像埋め込みを潜在変数として使用して橋渡しする統一フレームワークである。これらのパッチレベルの画像埋め込みは、MLLMsのCLIP視覚エンコーダと自然に整合しており、拡散モデルに軽量なControlNetの適応を加えて統合される。MLLMsの元々のマルチモーダル推論能力を保持するため、パッチレベルの画像埋め込みを予測する際に、元のMLLMパラメータから初期化された視覚生成ブランチをMLLMに装備する。事前学習済みのMLLMsと拡散モデルをパッチレベルのCLIP潜在変数でシームレスに統合することで、本フレームワークは高忠実度で制御可能な画像生成を実現し、訓練効率を大幅に向上させる。実験結果から、Bifrost-1は視覚的忠実度とマルチモーダル理解において従来の手法と同等またはそれ以上の性能を達成し、訓練中の計算量を大幅に削減できることが示された。また、設計選択の有効性を示す包括的なアブレーション研究も提供する。

English

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Bifrost-1: パッチレベルのCLIP潜在表現を用いたマルチモーダルLLMと拡散モデルの統合

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

要旨

Support