超維探針:基於向量符號架構解碼大語言模型表徵
Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures
September 29, 2025
作者: Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
cs.AI
摘要
尽管大型语言模型(LLMs)展现出强大的能力,其内部表征仍显得晦涩难懂,理解有限。现有的可解释性方法,如直接对数归因(DLA)和稀疏自编码器(SAEs),由于模型输出词汇量的限制或特征命名不明确等问题,提供的洞察力较为局限。本研究引入了一种新颖的解码范式——超维探针,旨在从LLM向量空间中解码信息。该探针融合了符号表征与神经探测的思想,通过向量符号架构(VSAs)将模型的残差流投射为可解释的概念。此探针结合了SAEs与传统探针的优势,同时克服了它们的关键局限。我们通过控制输入完成任务验证了该解码范式的有效性,在涵盖句法模式识别、键值关联及抽象推理的输入上,探测模型在下一词预测前的最终状态。进一步,我们在问答场景中评估了该探针,考察了模型在文本生成前后的状态。实验表明,我们的探针能够可靠地提取出跨越不同LLMs、嵌入尺寸及输入领域的有意义概念,并有助于识别LLM的失败案例。本研究推动了LLM向量空间中的信息解码,使得从神经表征中提取更具信息性、可解释性及结构化的特征成为可能。
English
Despite their capabilities, Large Language Models (LLMs) remain opaque with
limited understanding of their internal representations. Current
interpretability methods, such as direct logit attribution (DLA) and sparse
autoencoders (SAEs), provide restricted insight due to limitations such as the
model's output vocabulary or unclear feature names. This work introduces
Hyperdimensional Probe, a novel paradigm for decoding information from the LLM
vector space. It combines ideas from symbolic representations and neural
probing to project the model's residual stream into interpretable concepts via
Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs
and conventional probes while overcoming their key limitations. We validate our
decoding paradigm with controlled input-completion tasks, probing the model's
final state before next-token prediction on inputs spanning syntactic pattern
recognition, key-value associations, and abstract inference. We further assess
it in a question-answering setting, examining the state of the model both
before and after text generation. Our experiments show that our probe reliably
extracts meaningful concepts across varied LLMs, embedding sizes, and input
domains, also helping identify LLM failures. Our work advances information
decoding in LLM vector space, enabling extracting more informative,
interpretable, and structured features from neural representations.