可解释嵌入与稀疏自编码器：数据分析工具包

摘要

分析大规模文本语料库是机器学习领域的核心挑战，这对识别训练数据中不良模型行为或偏见等任务至关重要。当前方法通常依赖成本高昂的基于大语言模型的技术（如标注数据集差异）或稠密嵌入模型（如用于聚类），这些方法难以针对目标属性进行控制。我们提出使用稀疏自编码器生成SAE嵌入：这种表征的维度可映射到可解释的概念。通过四项数据分析任务，我们证明SAE嵌入比大语言模型更具成本效益和可靠性，同时比稠密嵌入更具可控性。利用SAE的大型假设空间，我们能够揭示诸如（1）数据集间的语义差异及（2）文档中意外概念关联等洞见。例如通过比较模型响应，我们发现Grok-4比其他九种前沿模型更频繁地澄清歧义。相较于大语言模型，SAE嵌入能以降低2-8倍的成本揭示更大差异，并更可靠地识别偏见。此外，SAE嵌入具有可控性：通过概念过滤，我们能（3）沿目标维度聚类文档，并（4）在基于属性的检索中超越稠密嵌入表现。借助SAE嵌入，我们通过两个案例研究模型行为：探究OpenAI模型随时间的行为变化，以及发现Tulu-3（Lambert等人，2024）从训练数据中学到的"触发"短语。这些成果确立了SAE作为非结构化数据分析多面手的地位，并凸显了通过数据解读模型这一被忽视的重要性。

English

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

可解释嵌入与稀疏自编码器：数据分析工具包

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

摘要

Support