Bifrost-1: Conectando LLMs Multimodais e Modelos de Difusão com Latentes CLIP em Nível de Patch

Resumo

Há um interesse crescente em integrar capacidades de síntese visual de alta fidelidade em modelos de linguagem de grande escala (LLMs) sem comprometer suas fortes habilidades de raciocínio. Os métodos existentes que treinam diretamente LLMs ou conectam LLMs a modelos de difusão geralmente sofrem com treinamentos custosos, já que os LLMs de base não foram expostos a representações de imagens durante o pré-treinamento. Apresentamos o Bifrost-1, um framework unificado que conecta LLMs multimodais pré-treinados (MLLMs) e modelos de difusão utilizando embeddings de imagens CLIP em nível de patch como variáveis latentes, que estão naturalmente alinhadas com o codificador visual CLIP do MLLM. Esses embeddings de imagens em nível de patch são integrados ao modelo de difusão com uma adaptação leve de seu ControlNet. Para manter as capacidades originais de raciocínio multimodal dos MLLMs, equipamos o MLLM com um ramo de geração visual inicializado a partir dos parâmetros originais do MLLM ao prever os embeddings de imagens em nível de patch. Ao integrar de forma contínua MLLMs pré-treinados e modelos de difusão com latentes CLIP em nível de patch, nosso framework permite a geração de imagens controláveis de alta fidelidade com eficiência significativa no treinamento. Nossos experimentos demonstram que o Bifrost-1 alcança desempenho comparável ou superior aos métodos anteriores em termos de fidelidade visual e compreensão multimodal, com um custo computacional substancialmente menor durante o treinamento. Também fornecemos estudos abrangentes de ablação que mostram a eficácia de nossas escolhas de design.

English

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Bifrost-1: Conectando LLMs Multimodais e Modelos de Difusão com Latentes CLIP em Nível de Patch

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Resumo

Support