델타 어텐션 잔차

초록

주의 잔차(Attention Residuals)는 표준 가산 잔차 연결을 이전 층 출력에 대한 학습된 소프트맥스 주의로 대체하여 선택적인 층 간 라우팅을 가능하게 한다. 그러나 표준 주의 잔차는 여전히 이전 층의 누적 은닉 상태에 주의를 기울이며, 이는 매우 중복적이다. 우리는 이러한 중복성이 더 깊은 층에서 라우팅 붕괴를 유발함을 보인다: 주의 가중치가 대비가 낮아져 균일 분포에 가까워지고(최대 가중치 약 0.2), 이전 층에서 정보성 있는 상태를 선택하는 모델의 능력을 제한한다. 이는 중요한 연구 과제를 제기한다: 주의 잔차에서 어떤 층별 표현이 라우팅되어야 하는가? 이 질문에 답하기 위해, 우리는 델타 주의 잔차(Delta Attention Residuals)를 제안한다. 이는 누적 상태 대신 델타, 즉 각 하위층이 도입한 변화(v_i = h_{i+1} - h_i)에 주의를 기울인다. 델타 표현은 구조적으로 다양하며 더 높은 대비의 주의 분포(최대 가중치 약 0.6)를 생성하여, 층 간 더 선택적이고 효과적인 라우팅을 가능하게 한다. 이 원리는 각 하위층 및 블록 단위 세분화 모두에 적용된다. 테스트된 모든 규모(220M~7.6B)에서 델타 주의 잔차는 표준 잔차와 주의 잔차를 일관되게 능가하며, 검증 혼란도에서 1.7~8.2%의 개선을 보인다. 또한 델타 주의 잔차는 표준 미세 조정을 통해 사전 학습된 체크포인트를 델타 주의 잔차로 변환할 수 있게 한다. 코드는 https://github.com/wdlctc/delta-attention-residuals-code 에서 확인할 수 있다.

English

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.