超維探針：基於向量符號架構解碼大語言模型表徵

摘要

尽管大型语言模型（LLMs）展现出强大的能力，其内部表征仍显得晦涩难懂，理解有限。现有的可解释性方法，如直接对数归因（DLA）和稀疏自编码器（SAEs），由于模型输出词汇量的限制或特征命名不明确等问题，提供的洞察力较为局限。本研究引入了一种新颖的解码范式——超维探针，旨在从LLM向量空间中解码信息。该探针融合了符号表征与神经探测的思想，通过向量符号架构（VSAs）将模型的残差流投射为可解释的概念。此探针结合了SAEs与传统探针的优势，同时克服了它们的关键局限。我们通过控制输入完成任务验证了该解码范式的有效性，在涵盖句法模式识别、键值关联及抽象推理的输入上，探测模型在下一词预测前的最终状态。进一步，我们在问答场景中评估了该探针，考察了模型在文本生成前后的状态。实验表明，我们的探针能够可靠地提取出跨越不同LLMs、嵌入尺寸及输入领域的有意义概念，并有助于识别LLM的失败案例。本研究推动了LLM向量空间中的信息解码，使得从神经表征中提取更具信息性、可解释性及结构化的特征成为可能。

English

Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

超維探針：基於向量符號架構解碼大語言模型表徵

Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

摘要

Support