ChatPaper.aiChatPaper

LatentLens:揭示大型语言模型中高度可解释的视觉标记

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

January 31, 2026
作者: Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach
cs.AI

摘要

將大型語言模型(LLM)轉化為視覺語言模型(VLM)可通過將視覺編碼器產生的視覺標記映射至LLM的嵌入空間來實現。有趣的是,這種映射僅需一個淺層多層感知機(MLP)變換即可完成。為理解LLM為何能如此順暢地處理視覺標記,我們需要可解釋性方法來揭示LLM每一層處理過程中視覺標記表徵所編碼的內容。本研究提出LatentLens——一種將潛在表徵映射至自然語言描述的新方法。該方法通過編碼大規模文本語料庫,並存儲語料中每個標記的上下文表徵來實現。隨後將視覺標記表徵與其上下文文本表徵進行比對,並通過Top-k最近鄰表徵生成視覺標記的描述。我們在10種不同VLM上評估該方法,結果表明常用方法(如LogitLens)會嚴重低估視覺標記的可解釋性。而使用LatentLens時,所有研究模型和所有層級中的多數視覺標記均具備可解釋性。定性分析顯示,LatentLens生成的描述具有語義相關性,且相比單個標記能為人類提供更細粒度的解釋。更廣泛而言,我們的研究為視覺與語言表徵的對齊性提供了新證據,為分析潛在表徵開闢了新方向。
English
Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.
PDF141February 12, 2026