ChatPaper.aiChatPaper

在语言模型的词典学习中衡量可解释性的进展与棋盘游戏模型

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

July 31, 2024
作者: Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
cs.AI

摘要

语言模型(LM)表示中编码了哪些潜在特征? 最近关于训练稀疏自动编码器(SAEs)以解开LM表示中可解释特征的工作显示出了显著的前景。然而,评估这些SAEs的质量是困难的,因为我们缺乏一个可解释特征的基本真值集合,我们期望良好的SAEs能够恢复。因此,我们建议通过在训练了国际象棋和奥赛洛(Othello)转录的LMs的环境中工作来衡量可解释字典学习的进展。这些环境具有自然的可解释特征集合,例如“F3上有一个骑士”,我们利用这些特征集合来制定SAE质量的监督度量标准。为了指导可解释字典学习的进展,我们引入了一种新的SAE训练技术,p-退火,它提高了先前无监督度量标准以及我们的新度量标准上的性能。
English
What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. These settings carry natural collections of interpretable features -- for example, "there is a knight on F3" -- which we leverage into supervised metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique, p-annealing, which improves performance on prior unsupervised metrics as well as our new metrics.

Summary

AI-Generated Summary

PDF82November 28, 2024