混合深度注意力

摘要

增加网络深度是推动大语言模型（LLMs）发展的关键驱动力。然而，随着模型深度的增加，信号衰减问题日益凸显：浅层形成的特征信息在连续残差更新过程中逐渐被稀释，导致深层网络难以有效捕捉这些特征。我们提出混合深度注意力机制（MoDA），该机制允许每个注意力头同时关注当前层的序列键值对和来自前面各层的深度键值对。我们还设计了一种硬件友好的MoDA算法，通过解决非连续内存访问模式问题，在64K序列长度下达到了FlashAttention-2 97.3%的运行效率。在15亿参数模型上的实验表明，MoDA始终优于现有基线方法：在10个验证基准上平均困惑度降低0.2，在10个下游任务中平均性能提升2.11%，而计算开销仅增加3.7%的FLOPs。同时发现MoDA与后归一化结合使用比与前归一化搭配能获得更优性能。这些结果表明MoDA是实现深度扩展的有效模块。代码已开源：https://github.com/hustvl/MoDA。

English

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .