ハイパーディメンショナルプローブ：ベクトルシンボリックアーキテクチャによるLLM表現のデコード

要旨

大規模言語モデル（LLM）はその能力にもかかわらず、内部表現に対する理解が限られており、不透明なままである。現在の解釈可能性手法、例えば直接ロジット帰属（DLA）やスパースオートエンコーダ（SAE）は、モデルの出力語彙や不明確な特徴名などの制約により、限定的な洞察しか提供しない。本研究では、LLMのベクトル空間から情報をデコードするための新しいパラダイムである「ハイパーディメンショナルプローブ」を提案する。これは、シンボリック表現とニューラルプロービングのアイデアを組み合わせ、ベクトルシンボリックアーキテクチャ（VSA）を介してモデルの残差ストリームを解釈可能な概念に投影するものである。このプローブは、SAEと従来のプローブの長所を組み合わせつつ、それらの主要な制約を克服する。我々は、構文パターン認識、キー・バリュー連想、抽象推論にわたる入力に対して、次のトークン予測前のモデルの最終状態をプロービングする制御された入力補完タスクを用いて、このデコードパラダイムを検証する。さらに、質問応答設定において、テキスト生成前後のモデルの状態を検証する。実験結果は、我々のプローブが様々なLLM、埋め込みサイズ、入力ドメインにわたって意味のある概念を確実に抽出し、LLMの失敗を特定するのに役立つことを示している。本研究は、LLMのベクトル空間における情報デコードを進化させ、ニューラル表現からより情報量が多く、解釈可能で構造化された特徴を抽出することを可能にする。

English

Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

ハイパーディメンショナルプローブ：ベクトルシンボリックアーキテクチャによるLLM表現のデコード

Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

要旨

Support