基於編碼器的語言模型中，作者身份信號湧現於何處？

摘要

使用相同預訓練編碼器、資料與損失函數進行微調的作者歸因模型，其效能可能僅因評分機制的不同而出現四倍差異。我們運用機械可解釋性工具來解釋此差異。詞長、標點密度與功能詞頻率等風格特徵，在每個模型（包含現成控制編碼器）的每一層中均同樣可用，因此該差異並非源於表徵品質。相反地，因果干預顯示，評分機制決定了編碼器在何處鞏固作者身分訊號：平均池化迫使鞏固發生在早期至中期層，而後期互動則將其延遲至較後層。我們進一步從各評分機制的梯度結構推導出此差異，而訓練動態則揭示了遵循該差異的不同學習軌跡。

English

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.