혼합-심도 어텐션

초록

깊이 확장은 대규모 언어 모델(LLM)의 핵심 동인입니다. 그러나 LLM이 더 깊어질수록 신호 저하 현상이 자주 발생합니다. 즉, 얕은 층에서 형성된 정보성 특징이 반복적인 잔차 업데이트로 점차 희석되어 깊은 층에서 이를 회복하기 어려워집니다. 본 논문에서는 혼합 깊이 어텐션(MoDA)을 소개합니다. 이는 각 어텐션 헤드가 현재 층의 시퀀스 키-값 쌍과 선행 층들의 깊이 키-값 쌍에 동시에 주목할 수 있게 하는 메커니즘입니다. 또한 비연속적 메모리 접근 패턴을 해결하여 시퀀스 길이 64K에서 FlashAttention-2 효율의 97.3%를 달성하는 하드웨어 효율적 MoDA 알고리즘을 제시합니다. 1.5B 매개변수 모델에 대한 실험 결과, MoDA가 강력한 기준 모델들을 지속적으로 능가함을 확인했습니다. 특히, 10개 검증 벤치마크에서 평균 복잡도를 0.2 개선하고 10개 다운스트림 작업에서 평균 성능을 2.11% 향상시켰으며, FLOPs 계산 오버헤드는 3.7%에 불과했습니다. 또한 MoDA를 사후 정규화와 결합했을 때 사전 정규화와 결합하는 것보다 더 나은 성능을 보였습니다. 이러한 결과는 MoDA가 깊이 확장을 위한 유망한 기본 구성 요소임을 시사합니다. 코드는 https://github.com/hustvl/MoDA에서 공개되었습니다.

English

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

혼합-심도 어텐션

Mixture-of-Depths Attention

초록

Support