基于编码器的语言模型中，作者身份信号从何涌现？

摘要

使用相同预训练编码器、数据和损失函数微调的 authorship attribution 模型，仅因评分机制不同，其性能可相差四倍。我们借助机制可解释性工具解释这一差异。词长、标点密度和功能词频率等文体特征在每个模型的每一层（包括现成的控制编码器）中均可同等获取，因此性能差距并非源于表征质量。相反，因果干预表明，评分器决定了编码器整合作者身份信号的位置。平均池化迫使信号整合发生在较早至中间层，而后期交互则将其推迟至更后层。我们进一步从各评分器的梯度结构推导出这一差异，训练动态揭示了由此产生的不同学习轨迹。

English

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.