分析特徵流以增強語言模型中的解釋和控制。
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
February 5, 2025
作者: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
cs.AI
摘要
我們提出了一種新方法,用於系統性地映射稀疏自編碼器在大型語言模型的連續層中發現的特徵,擴展了早期研究,該研究檢驗了層間特徵連結。通過使用無數據餵入的餘弦相似度技術,我們追蹤特定特徵在每個階段的持續性、轉換或首次出現方式。這種方法產生了特徵演變的細粒度流程圖,實現了細緻的可解釋性,並深入了解模型計算的機制。至關重要的是,我們展示了這些跨層特徵映射如何促進通過放大或抑制選定特徵來直接引導模型行為,實現文本生成中的有針對性主題控制。總的來說,我們的發現突顯了一種因果、跨層可解釋性框架的實用性,不僅澄清了特徵如何通過前向傳遞進行發展,還提供了大型語言模型透明操作的新手段。
English
We introduce a new approach to systematically map features discovered by
sparse autoencoder across consecutive layers of large language models,
extending earlier work that examined inter-layer feature links. By using a
data-free cosine similarity technique, we trace how specific features persist,
transform, or first appear at each stage. This method yields granular flow
graphs of feature evolution, enabling fine-grained interpretability and
mechanistic insights into model computations. Crucially, we demonstrate how
these cross-layer feature maps facilitate direct steering of model behavior by
amplifying or suppressing chosen features, achieving targeted thematic control
in text generation. Together, our findings highlight the utility of a causal,
cross-layer interpretability framework that not only clarifies how features
develop through forward passes but also provides new means for transparent
manipulation of large language models.Summary
AI-Generated Summary