轻量级双融合框架LightBagel:统一的多模态理解与生成
LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
October 27, 2025
作者: Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie
cs.AI
摘要
近年来,统一多模态模型在能力与通用性方面取得显著突破,但主流系统仍多采用从头训练的范式且需消耗大量算力资源。本文提出通过策略性融合专精于生成或理解任务的公开模型,即可用更高效率实现具有竞争力的性能。我们的核心设计是在保留原始网络块的同时,全域穿插多模态自注意力模块。这种双重融合机制具有两大优势:(1)在充分实现多模态融合的同时,最大限度保留基座模型的原始优势;(2)促使理解编码器的高级语义表征与生成编码器的低级空间信号产生协同融合。该方法仅需约350亿标记的训练量,便在多项基准测试中取得优异表现:组合式文生图任务GenEval得分0.91,复杂文生图任务DPG-Bench得分82.16,图像编辑任务GEditBench与ImgEdit-Bench分别获得6.06和3.77分。我们完整开源代码、模型权重及数据集,以助力统一多模态建模的未来研究。
English
Unified multimodal models have recently shown remarkable gains in both
capability and versatility, yet most leading systems are still trained from
scratch and require substantial computational resources. In this paper, we show
that competitive performance can be obtained far more efficiently by
strategically fusing publicly available models specialized for either
generation or understanding. Our key design is to retain the original blocks
while additionally interleaving multimodal self-attention blocks throughout the
networks. This double fusion mechanism (1) effectively enables rich multi-modal
fusion while largely preserving the original strengths of the base models, and
(2) catalyzes synergistic fusion of high-level semantic representations from
the understanding encoder with low-level spatial signals from the generation
encoder. By training with only ~ 35B tokens, this approach achieves strong
results across multiple benchmarks: 0.91 on GenEval for compositional
text-to-image generation, 82.16 on DPG-Bench for complex text-to-image
generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By
fully releasing the entire suite of code, model weights, and datasets, we hope
to support future research on unified multimodal modeling.