分析特徵流以增強語言模型中的解釋和控制。

摘要

我們提出了一種新方法，用於系統性地映射稀疏自編碼器在大型語言模型的連續層中發現的特徵，擴展了早期研究，該研究檢驗了層間特徵連結。通過使用無數據餵入的餘弦相似度技術，我們追蹤特定特徵在每個階段的持續性、轉換或首次出現方式。這種方法產生了特徵演變的細粒度流程圖，實現了細緻的可解釋性，並深入了解模型計算的機制。至關重要的是，我們展示了這些跨層特徵映射如何促進通過放大或抑制選定特徵來直接引導模型行為，實現文本生成中的有針對性主題控制。總的來說，我們的發現突顯了一種因果、跨層可解釋性框架的實用性，不僅澄清了特徵如何通過前向傳遞進行發展，還提供了大型語言模型透明操作的新手段。

English

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

分析特徵流以增強語言模型中的解釋和控制。

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

摘要

Support