인코더 기반 언어 모델에서 저자 신호는 어디서 나타나는가?

초록

동일한 사전 학습된 인코더, 데이터, 손실 함수를 사용하여 미세 조정된 저자 식별 모델은 점수 측정 방식에 따라 성능이 최대 4배까지 차이날 수 있다. 우리는 기계적 해석 가능성 도구를 활용하여 이러한 격차를 설명한다. 단어 길이, 구두점 밀도, 기능어 빈도수와 같은 문체적 특징은 기성 제어 인코더를 포함한 모든 모델의 모든 계층에서 동등하게 활용 가능하므로, 이 격차는 표현 품질에서 비롯된 것이 아니다. 대신, 인과적 개입은 점수 측정기가 인코더가 저자 신호를 응집하는 위치를 결정함을 보여준다. 평균 풀링은 초기~중간 계층으로의 응집을 강제하는 반면, 지연 상호작용은 이를 후기 계층으로 미룬다. 우리는 이러한 차이를 각 점수 측정기의 그래디언트 구조로부터 추가로 도출하며, 훈련 동역학은 그 차이로부터 비롯되는 뚜렷한 학습 궤적을 드러낸다.

English

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.