エンコーダベースの言語モデルにおいて、著者性信号はどこに現れるのか？

要旨

同一の事前学習済みエンコーダ、データ、損失関数を用いてファインチューニングされた著者推定モデルであっても、そのスコアリング機構のみに依存して性能が最大4倍も異なる可能性がある。本稿では、メカニズム的解釈可能性ツールを用いてこの性能差を説明する。単語長、句読点密度、機能語頻度といった文体特徴は、事前学習済みの既製の制御用エンコーダを含むすべてのモデルの全層で同等に利用可能であり、したがって性能差は表現の質に起因するものではない。代わりに、因果介入により、スコアラーがエンコーダ内で著者シグナルを統合する場所を決定することが示される。平均プーリングは前半から中盤の層に統合を強制する一方、後期相互作用はそれを後半の層に先送りする。さらに、この違いが各スコアラーの勾配構造から導出されることを明らかにし、学習ダイナミクスはその違いに従う異なる学習軌跡を示すことを明らかにする。

English

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.