UniFusion:视觉语言模型作为图像生成中的统一编码器
UniFusion: Vision-Language Model as Unified Encoder in Image Generation
October 14, 2025
作者: Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale
cs.AI
摘要
尽管视觉生成领域近期取得了显著进展,但现有的大多数架构仍依赖于独立的图像和文本编码器。这种分离限制了扩散模型在跨模态推理和知识迁移方面的能力。以往尝试弥合这一差距的方法通常利用视觉语言模型(VLM)的最后一层信息、采用多个视觉编码器,或联合训练大规模统一模型以同时生成文本和图像,这不仅需要大量的计算资源和大规模数据,也限制了其普及性。我们提出了UniFusion,一种基于扩散的生成模型,它以一个冻结的大型视觉语言模型(VLM)作为统一的多模态编码器进行条件生成。UniFusion的核心是层级注意力池化(LAP)机制,该机制从冻结VLM的文本和视觉标记中提取高层次语义和低层次细节,为扩散生成模型提供条件。我们证明,在文本-图像对齐生成以及从VLM到扩散模型的视觉信息忠实迁移(这对编辑至关重要)方面,LAP优于其他浅层融合架构。我们提出了VLM启用的重写注入与灵活推理(VERIFI),它仅在模型内提示重写过程中,基于VLM生成的文本标记对扩散变换器(DiT)进行条件生成。VERIFI结合了条件分布与VLM推理能力的对齐,从而在推理时增强了能力和灵活性。此外,针对编辑任务的微调不仅提升了生成时的文本-图像对齐,表明跨模态知识迁移的存在,还展现了强大的泛化能力。我们的模型在单图像编辑任务上训练后,能够零样本泛化到多图像参考场景,进一步验证了UniFusion统一编码器设计的优越性。
English
Although recent advances in visual generation have been remarkable, most
existing architectures still depend on distinct encoders for images and text.
This separation constrains diffusion models' ability to perform cross-modal
reasoning and knowledge transfer. Prior attempts to bridge this gap often use
the last layer information from VLM, employ multiple visual encoders, or train
large unified models jointly for text and image generation, which demands
substantial computational resources and large-scale data, limiting its
accessibility.We present UniFusion, a diffusion-based generative model
conditioned on a frozen large vision-language model (VLM) that serves as a
unified multimodal encoder. At the core of UniFusion is the Layerwise Attention
Pooling (LAP) mechanism that extracts both high level semantics and low level
details from text and visual tokens of a frozen VLM to condition a diffusion
generative model. We demonstrate that LAP outperforms other shallow fusion
architectures on text-image alignment for generation and faithful transfer of
visual information from VLM to the diffusion model which is key for editing. We
propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI),
which conditions a diffusion transformer (DiT) only on the text tokens
generated by the VLM during in-model prompt rewriting. VERIFI combines the
alignment of the conditioning distribution with the VLM's reasoning
capabilities for increased capabilities and flexibility at inference. In
addition, finetuning on editing task not only improves text-image alignment for
generation, indicative of cross-modality knowledge transfer, but also exhibits
tremendous generalization capabilities. Our model when trained on single image
editing, zero-shot generalizes to multiple image references further motivating
the unified encoder design of UniFusion.