迈向多模态理解：以稳定扩散作为任务感知的特征提取器

摘要

近期，多模态大语言模型（MLLMs）的进展已实现了基于图像的问答功能。然而，一个关键局限在于使用CLIP作为视觉编码器；尽管它能捕捉粗略的全局信息，却常常遗漏与输入查询相关的细粒度细节。为克服这些不足，本研究探讨了预训练的文本到图像扩散模型是否可作为指令感知的视觉编码器。通过对其内部表征的分析，我们发现扩散特征不仅语义丰富，还能编码强烈的图文对齐关系。此外，我们发现可利用文本条件引导模型聚焦于与输入问题相关的区域。随后，我们研究了如何将这些特征与大语言模型对齐，并揭示了一种信息泄露现象，即大语言模型可能无意中恢复原始扩散提示的信息。我们分析了泄露的原因并提出了缓解策略。基于这些洞见，我们探索了一种简单的融合策略，同时利用CLIP和条件扩散特征。我们在通用视觉问答（VQA）和专门的多模态大语言模型基准测试中评估了我们的方法，展示了扩散模型在视觉理解，尤其是需要空间和组合推理的视觉中心任务中的潜力。我们的项目页面可访问https://vatsalag99.github.io/mustafar/。

English

Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.

迈向多模态理解：以稳定扩散作为任务感知的特征提取器

Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

摘要

Support