UniFusion:視覺語言模型作為圖像生成中的統一編碼器
UniFusion: Vision-Language Model as Unified Encoder in Image Generation
October 14, 2025
作者: Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale
cs.AI
摘要
儘管視覺生成領域近期取得了顯著進展,但現有的大多數架構仍依賴於獨立的圖像和文本編碼器。這種分離限制了擴散模型在跨模態推理和知識遷移方面的能力。先前嘗試彌合這一差距的方法通常利用視覺語言模型(VLM)的最後一層信息、採用多個視覺編碼器,或聯合訓練大型統一模型以實現文本和圖像生成,這些方法需要大量的計算資源和大規模數據,從而限制了其可及性。我們提出了UniFusion,這是一種基於擴散的生成模型,它以凍結的大型視覺語言模型(VLM)作為統一的多模態編碼器進行條件化。UniFusion的核心是層級注意力池化(LAP)機制,該機制從凍結VLM的文本和視覺標記中提取高層語義和低層細節,以條件化擴散生成模型。我們證明,在生成任務的文本-圖像對齊以及將視覺信息從VLM忠實遷移至擴散模型(這是編輯的關鍵)方面,LAP優於其他淺層融合架構。我們提出了基於VLM的靈活推理重寫注入(VERIFI),它僅在模型內提示重寫期間,根據VLM生成的文本標記來條件化擴散變換器(DiT)。VERIFI將條件分佈的對齊與VLM的推理能力相結合,從而增強了推理時的能力和靈活性。此外,針對編輯任務的微調不僅改善了生成任務的文本-圖像對齊,表明跨模態知識遷移的存在,還展現出極強的泛化能力。我們的模型在單一圖像編輯任務上訓練後,能夠零樣本泛化到多圖像參考場景,這進一步激勵了UniFusion統一編碼器設計的合理性。
English
Although recent advances in visual generation have been remarkable, most
existing architectures still depend on distinct encoders for images and text.
This separation constrains diffusion models' ability to perform cross-modal
reasoning and knowledge transfer. Prior attempts to bridge this gap often use
the last layer information from VLM, employ multiple visual encoders, or train
large unified models jointly for text and image generation, which demands
substantial computational resources and large-scale data, limiting its
accessibility.We present UniFusion, a diffusion-based generative model
conditioned on a frozen large vision-language model (VLM) that serves as a
unified multimodal encoder. At the core of UniFusion is the Layerwise Attention
Pooling (LAP) mechanism that extracts both high level semantics and low level
details from text and visual tokens of a frozen VLM to condition a diffusion
generative model. We demonstrate that LAP outperforms other shallow fusion
architectures on text-image alignment for generation and faithful transfer of
visual information from VLM to the diffusion model which is key for editing. We
propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI),
which conditions a diffusion transformer (DiT) only on the text tokens
generated by the VLM during in-model prompt rewriting. VERIFI combines the
alignment of the conditioning distribution with the VLM's reasoning
capabilities for increased capabilities and flexibility at inference. In
addition, finetuning on editing task not only improves text-image alignment for
generation, indicative of cross-modality knowledge transfer, but also exhibits
tremendous generalization capabilities. Our model when trained on single image
editing, zero-shot generalizes to multiple image references further motivating
the unified encoder design of UniFusion.