FUSION:视觉-语言表征的深度融合以实现跨模态深度理解
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
April 14, 2025
作者: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
cs.AI
摘要
我们推出了FUSION,一个采用全视觉-语言对齐与整合范式的多模态大语言模型(MLLMs)家族。与现有方法主要依赖大语言模型解码阶段的后期模态交互不同,我们的方法在整个处理流程中实现了深度、动态的整合。为此,我们提出了文本引导的统一视觉编码,在视觉编码中融入文本信息,达到像素级别的整合。我们进一步设计了上下文感知的递归对齐解码,在解码过程中根据文本上下文递归聚合视觉特征,实现细粒度、问题级别的语义整合。为了指导特征映射并缓解模态差异,我们开发了双重监督的语义映射损失。此外,通过一种新的数据合成方法,我们构建了一个合成语言驱动的问答(QA)数据集,优先考虑高质量的问答对以优化文本引导的特征整合。基于这些基础,我们训练了两种规模的FUSION模型——3B和8B,并展示了我们的全模态整合方法仅使用630个视觉标记就显著超越了现有方法。值得注意的是,FUSION 3B在大多数基准测试上超越了Cambrian-1 8B和Florence-VL 8B。即使在仅限300个视觉标记的情况下,FUSION 3B仍持续优于Cambrian-1 8B。我们的消融研究表明,在相同配置下,FUSION在超过半数的基准测试上优于LLaVA-NeXT,无需动态分辨率,凸显了我们方法的有效性。我们公开了代码、模型权重及数据集。https://github.com/starriver030515/FUSION
English
We introduce FUSION, a family of multimodal large language models (MLLMs)
with a fully vision-language alignment and integration paradigm. Unlike
existing methods that primarily rely on late-stage modality interaction during
LLM decoding, our approach achieves deep, dynamic integration throughout the
entire processing pipeline. To this end, we propose Text-Guided Unified Vision
Encoding, incorporating textual information in vision encoding to achieve
pixel-level integration. We further design Context-Aware Recursive Alignment
Decoding that recursively aggregates visual features conditioned on textual
context during decoding, enabling fine-grained, question-level semantic
integration. To guide feature mapping and mitigate modality discrepancies, we
develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a
Synthesized Language-Driven Question-Answer (QA) dataset through a new data
synthesis method, prioritizing high-quality QA pairs to optimize text-guided
feature integration. Building on these foundations, we train FUSION at two
scales-3B, 8B-and demonstrate that our full-modality integration approach
significantly outperforms existing methods with only 630 vision tokens.
Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most
benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited
to 300 vision tokens. Our ablation studies show that FUSION outperforms
LLaVA-NeXT on over half of the benchmarks under same configuration without
dynamic resolution, highlighting the effectiveness of our approach. We release
our code, model weights, and dataset. https://github.com/starriver030515/FUSIONSummary
AI-Generated Summary