可解释嵌入与稀疏自编码器：数据分析工具包

摘要

分析大规模文本语料库是机器学习领域的核心挑战，这对识别训练数据中的不良模型行为或偏见等任务至关重要。当前方法通常依赖成本高昂的基于大语言模型的技术（如标注数据集差异）或稠密嵌入模型（如用于聚类），这些方法难以针对特定属性进行控制。我们提出使用稀疏自编码器生成SAE嵌入表征：其维度可映射到可解释的概念。通过四项数据分析任务，我们证明SAE嵌入相较于大语言模型更具成本效益和可靠性，相比稠密嵌入则具有更好的可控性。利用SAE庞大的假设空间，我们能够揭示如下发现：（1）数据集间的语义差异；（2）文档中意外的概念关联。例如通过对比模型响应，我们发现Grok-4比其他九款前沿模型更频繁地澄清歧义。相较于大语言模型，SAE嵌入能以2-8倍的低成本揭示更显著的差异，并更可靠地识别偏见。此外，SAE嵌入具有可控性：通过概念过滤可实现（3）沿目标维度进行文档聚类，并（4）在基于属性的检索任务中超越稠密嵌入效果。借助SAE嵌入，我们通过两个案例研究模型行为：探究OpenAI模型随时间的演变规律，以及发现Tulu-3（Lambert等，2024）从训练数据中学到的"触发"短语。这些成果确立了SAE作为非结构化数据分析多面手的地位，并凸显了通过数据解读模型这一被忽视的重要维度。

English

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

可解释嵌入与稀疏自编码器：数据分析工具包

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

摘要

Support