Bifrost-1：以片段级CLIP潜在空间为桥梁，连接多模态大语言模型与扩散模型

摘要

將高保真視覺合成能力整合至大型語言模型（LLMs）而不損害其強大的推理能力，這一課題日益受到關注。現有方法直接訓練LLMs或橋接LLMs與擴散模型，通常因LLMs在預訓練期間未接觸圖像表示而導致訓練成本高昂。我們提出了Bifrost-1，這是一個統一框架，利用patch級別的CLIP圖像嵌入作為潛在變量，橋接預訓練的多模態LLMs（MLLMs）與擴散模型，這些嵌入與MLLM的CLIP視覺編碼器自然對齊。這些patch級別的圖像嵌入通過輕量級改進的ControlNet整合到擴散模型中。為了保留MLLMs原有的多模態推理能力，我們在預測patch級別圖像嵌入時，為MLLM配備了一個視覺生成分支，該分支從原始MLLM參數初始化。通過無縫整合預訓練的MLLMs與擴散模型及patch級別的CLIP潛在變量，我們的框架實現了高保真可控圖像生成，並顯著提升了訓練效率。實驗表明，Bifrost-1在視覺保真度和多模態理解方面與先前方法相比，表現相當或更優，且訓練期間的計算量大幅降低。我們還提供了全面的消融研究，證明了我們設計選擇的有效性。

English

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Bifrost-1：以片段级CLIP潜在空间为桥梁，连接多模态大语言模型与扩散模型

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

摘要

Support