邁向多模態理解：以穩定擴散作為任務感知特徵提取器

摘要

近期，多模态大语言模型（MLLMs）的进展已实现了基于图像的问答能力。然而，其关键局限在于采用CLIP作为视觉编码器；尽管CLIP能够捕捉粗略的全局信息，却常遗漏与输入查询相关的细粒度细节。为弥补这些不足，本研究探讨了预训练的文本到图像扩散模型是否可作为指令感知的视觉编码器。通过对其内部表征的分析，我们发现扩散特征不仅语义丰富，还能编码强图像-文本对齐。此外，我们发现可利用文本条件引导模型聚焦于与输入问题相关的区域。随后，我们研究了如何将这些特征与大语言模型对齐，并揭示了一种信息泄露现象，即大语言模型可能无意中恢复原始扩散提示的信息。我们分析了此泄露的原因，并提出了缓解策略。基于这些洞见，我们探索了一种简单的融合策略，同时利用CLIP和条件扩散特征。我们在通用视觉问答（VQA）及专门的多模态大语言模型基准测试上评估了该方法，展示了扩散模型在视觉理解，尤其是需要空间与组合推理的视觉中心任务中的潜力。我们的项目页面可见于https://vatsalag99.github.io/mustafar/。

English

Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.

邁向多模態理解：以穩定擴散作為任務感知特徵提取器

Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

摘要

Support