주의 잔차

초록

PreNorm 잔여 연결은 현대 대규모 언어 모델의 표준이지만, 모든 계층 출력을 고정된 단위 가중치로 누적합니다. 이러한 균일한 집계는 제어되지 않은 은닉 상태 성장을 초래하며 깊이에 따라 각 계층의 기여도를 점진적으로 희석시킵니다. 우리는 이 고정된 누적을 선행 계층 출력에 대한 소프트맥스 어텐션으로 대체하는 Attention Residuals(AttnRes)를 제안합니다. 이를 통해 각 계층은 학습된 입력 종속 가중치로 이전 표현을 선택적으로 집계할 수 있습니다. 대규모 모델 학습 시 모든 선행 계층 출력에 대한 어텐션으로 인한 메모리 및 통신 오버헤드를 해결하기 위해, 계층을 블록으로 분할하고 블록 수준 표현에 어텐션을 적용하는 Block AttnRes를 도입했습니다. 이는 전체 AttnRes의 이점 대부분을 유지하면서 메모리 사용량을 줄입니다. 캐시 기반 파이프라인 통신과 2단계 계산 전략과 결합된 Block AttnRes는 최소 오버헤드로 표준 잔여 연결을 실용적으로 대체할 수 있습니다. 스케일링 법칙 실험을 통해 모델 크기에 관계없이 개선 효과가 일관됨을 확인했으며, ablation 연구를 통해 내용 기반 깊이 방향 선택의 이점을 검증했습니다. 또한 AttnRes를 Kimi Linear 아키텍처(총 48B / 활성화 매개변수 3B)에 통합하고 1.4T 토큰으로 사전 학습을 수행한 결과, AttnRes가 PreNorm 희석을 완화하고 깊이에 걸쳐 더 균일한 출력 크기와 기울기 분포를 생성하며, 평가된 모든 다운스트림 작업에서 성능을 향상시킴을 확인했습니다.

English

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.