衡量語言模型詞典學習的進展:以棋盤遊戲模型為例
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
July 31, 2024
作者: Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
cs.AI
摘要
語言模型(LM)表示中編碼了哪些潛在特徵?
最近有關訓練稀疏自編碼器(SAEs)以解開LM表示中可解釋特徵的研究表現出顯著的潛力。然而,評估這些SAEs的質量是困難的,因為我們缺乏一個可解釋特徵的真實收集,我們期望良好的SAEs能夠恢復。因此,我們建議通過在訓練在國際象棋和奧賽羅(Othello)對話記錄上的LMs的情況下工作,來衡量可解釋字典學習的進展。這些設置具有自然的可解釋特徵集合,例如“F3上有一個騎士”,我們將其利用為SAE質量的監督指標。為了引導可解釋字典學習的進展,我們引入了一種新的SAE訓練技術,即p-退火,它提高了在先前的無監督指標以及我們的新指標上的性能。
English
What latent features are encoded in language model (LM) representations?
Recent work on training sparse autoencoders (SAEs) to disentangle interpretable
features in LM representations has shown significant promise. However,
evaluating the quality of these SAEs is difficult because we lack a
ground-truth collection of interpretable features that we expect good SAEs to
recover. We thus propose to measure progress in interpretable dictionary
learning by working in the setting of LMs trained on chess and Othello
transcripts. These settings carry natural collections of interpretable features
-- for example, "there is a knight on F3" -- which we leverage into
supervised metrics for SAE quality. To guide progress in
interpretable dictionary learning, we introduce a new SAE training technique,
p-annealing, which improves performance on prior unsupervised
metrics as well as our new metrics.Summary
AI-Generated Summary