Bifrost-1：通过Patch级CLIP潜在空间桥接多模态大语言模型与扩散模型

摘要

当前，将高保真视觉合成能力融入大型语言模型（LLMs）而不削弱其强大的推理能力，正引起越来越多的关注。现有方法直接训练LLMs或桥接LLMs与扩散模型，通常面临高昂的训练成本，因为骨干LLMs在预训练期间未曾接触过图像表示。我们提出了Bifrost-1，一个统一框架，它利用作为潜在变量的patch级CLIP图像嵌入，桥接了预训练的多模态LLMs（MLLMs）与扩散模型，这些嵌入天然与MLLM的CLIP视觉编码器对齐。通过轻量级调整扩散模型的ControlNet，这些patch级图像嵌入被整合进扩散模型中。为了保留MLLMs原有的多模态推理能力，我们在预测patch级图像嵌入时，为MLLM配备了一个视觉生成分支，该分支从原始MLLM参数初始化。通过无缝集成预训练的MLLMs与扩散模型，并利用patch级CLIP潜在变量，我们的框架实现了高保真可控图像生成，同时显著提升了训练效率。实验表明，Bifrost-1在视觉保真度和多模态理解方面，与先前方法相比，表现相当或更优，且训练过程中的计算量大幅降低。我们还提供了全面的消融研究，验证了设计选择的有效性。

English

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Bifrost-1：通过Patch级CLIP潜在空间桥接多模态大语言模型与扩散模型

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

摘要

Support