LatentLens:揭示大型语言模型中高度可解释的视觉标记
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
January 31, 2026
作者: Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach
cs.AI
摘要
将大型语言模型(LLM)转化为视觉语言模型(VLM)可通过将视觉编码器生成的视觉标记映射至LLM的嵌入空间来实现。有趣的是,这种映射仅需简单的浅层MLP变换即可完成。为理解LLM为何能轻松处理视觉标记,我们需要可解释性方法来揭示LLM每一层处理过程中视觉标记表征所编码的信息。本文提出LatentLens——一种将潜在表征映射至自然语言描述的新方法。该方法通过编码大规模文本语料库,存储其中每个标记的上下文表征,随后将视觉标记表征与文本标记的上下文表征进行比对,通过Top-k最近邻表征生成视觉标记的描述。我们在10种不同VLM上评估该方法,发现常用方法(如LogitLens)严重低估了视觉标记的可解释性。而采用LatentLens后,所有研究模型的所有层中大多数视觉标记均具备可解释性。定性研究表明,LatentLens生成的描述具有语义意义,相比单个标记能为人类提供更细粒度的解读。更广泛而言,我们的发现为视觉与语言表征的对齐关系提供了新证据,为分析潜在表征开辟了新方向。
English
Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.