フレーム表現仮説：マルチトークンLLMの解釈可能性と概念に誘導されたテキスト生成

要旨

解釈可能性は、大規模言語モデル（LLM）に対する信頼構築における主要な課題であり、これはモデルのパラメータから推論を抽出する複雑さに起因しています。我々は、フレーム表現仮説を提案します。これは、線形表現仮説（LRH）に基づく理論的に堅固なフレームワークであり、複数トークンの単語をモデル化することで、LLMを解釈および制御することを可能にします。これまでの研究では、LRHを使用してLLM表現を言語概念とつなげることが試みられてきましたが、単一トークンの分析に限定されていました。ほとんどの単語は複数のトークンで構成されているため、LRHを複数トークンの単語に拡張し、これにより数千の概念を持つ任意のテキストデータで使用できるようにします。このため、単語をフレームとして解釈できると提案し、トークン-単語の関係をよりよく捉えるベクトルの順序付きシーケンスとして構成します。その後、概念は、共通の概念を共有する単語フレームの平均として表現できます。我々は、これらのツールをTop-k Concept-Guided Decodingを通じて示し、選択した概念を使用してテキスト生成を直感的に誘導できることを示します。我々は、Llama 3.1、Gemma 2、およびPhi 3ファミリーでこれらの考えを検証し、性別や言語の偏り、有害なコンテンツを明らかにすると同時に、それらを是正する可能性を示し、より安全で透明性の高いLLMに導きます。コードは以下のリンクから入手可能です：https://github.com/phvv-me/frame-representation-hypothesis.git

English

Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git

フレーム表現仮説：マルチトークンLLMの解釈可能性と概念に誘導されたテキスト生成

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

要旨

Support