解釈可能な埋め込みとスパースオートエンコーダ：データ分析ツールキット

要旨

大規模テキストコーパスの分析は、機械学習における中核的な課題であり、望ましくないモデル行動や訓練データ内のバイアスの特定などのタスクに不可欠である。現在の手法は、コストがかかるLLMベースの技術（データセット差異の注釈付けなど）や密な埋め込みモデル（クラスタリング用など）に依存することが多いが、これらは関心対象の特性を制御できない。我々は、スパースオートエンコーダ（SAE）を用いてSAE埋め込みを生成することを提案する。これは、各次元が解釈可能な概念に対応する表現である。4つのデータ分析タスクを通じて、SAE埋め込みがLLMよりも費用対効果が高く信頼性があり、密な埋め込みよりも制御性が高いことを示す。SAEの大規模な仮説空間を利用することで、（1）データセット間の意味的差異や（2）文書内の予期せぬ概念相関といった知見を明らかにできる。例えば、モデル応答を比較することで、Grok-4が他の9つのフロンティアモデルよりも曖昧さを明確にする頻度が高いことを発見した。LLMと比較して、SAE埋め込みは2～8倍低コストでより大きな差異を検出し、バイアスをより確実に特定する。さらに、SAE埋め込みは制御可能である：概念をフィルタリングすることで、（3）関心軸に沿った文書クラスタリングや、（4）特性ベース検索における密な埋め込みの性能向上を実現できる。SAE埋め込みを用いて、2つのケーススタディを通じてモデル行動を調査する：OpenAIモデルの行動が時間とともにどのように変化したかの調査と、Tulu-3（Lambert et al., 2024）が訓練データから学習した「トリガー」フレーズの発見である。これらの結果は、SAEを非構造化データ分析の汎用ツールとして位置づけ、データを通じたモデル解釈の重要性が軽視されてきた点を浮き彫りにする。

English

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

解釈可能な埋め込みとスパースオートエンコーダ：データ分析ツールキット

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

要旨

Support