VL-SAE:基于统一概念集的视觉-语言对齐机制解析与增强
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set
October 24, 2025
作者: Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang
cs.AI
摘要
视觉-语言表征的对齐赋予当前视觉-语言模型(VLMs)强大的多模态推理能力。然而,由于难以将多模态表征的语义映射到统一概念集,该对齐组件的可解释性仍未得到充分研究。为解决此问题,我们提出VL-SAE——一种将视觉-语言表征编码至隐藏层激活的稀疏自编码器。其隐藏层中的每个神经元与由语义相似的图像和文本所表示的概念相关联,从而通过统一概念集解释这些表征。为建立神经元-概念关联,我们在自监督训练中促使语义相似的表征呈现一致的神经元激活。首先,为度量多模态表征的语义相似度,我们基于余弦相似度以显式形式实现其对齐;其次,构建具有距离编码器和两个模态专属解码器的VL-SAE,确保语义相似表征的激活一致性。在多个VLM(如CLIP、LLaVA)上的实验表明,VL-SAE在解释和增强视觉-语言对齐方面具有卓越能力。在解释层面,通过比较视觉与语言表征与概念的语义可理解其对齐关系;在增强层面,通过在概念层级对齐视觉-语言表征可强化对齐效果,从而提升零样本图像分类和幻象消除等下游任务性能。代码已发布于https://github.com/ssfgunner/VL-SAE。
English
The alignment of vision-language representations endows current
Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities.
However, the interpretability of the alignment component remains uninvestigated
due to the difficulty in mapping the semantics of multi-modal representations
into a unified concept set. To address this problem, we propose VL-SAE, a
sparse autoencoder that encodes vision-language representations into its hidden
activations. Each neuron in its hidden layer correlates to a concept
represented by semantically similar images and texts, thereby interpreting
these representations with a unified concept set. To establish the
neuron-concept correlation, we encourage semantically similar representations
to exhibit consistent neuron activations during self-supervised training.
First, to measure the semantic similarity of multi-modal representations, we
perform their alignment in an explicit form based on cosine similarity. Second,
we construct the VL-SAE with a distance-based encoder and two modality-specific
decoders to ensure the activation consistency of semantically similar
representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA)
demonstrate the superior capability of VL-SAE in interpreting and enhancing the
vision-language alignment. For interpretation, the alignment between vision and
language representations can be understood by comparing their semantics with
concepts. For enhancement, the alignment can be strengthened by aligning
vision-language representations at the concept level, contributing to
performance improvements in downstream tasks, including zero-shot image
classification and hallucination elimination. Codes are available at
https://github.com/ssfgunner/VL-SAE.