Megalodon：具有无限上下文长度的高效LLM预训练和推理

摘要

Transformer 模型的二次复杂度和弱长度外推限制了它们在长序列中扩展的能力，虽然存在线性注意力和状态空间模型等次二次解决方案，但它们在预训练效率和下游任务准确性方面在实证上表现不如 Transformers。我们引入了Megalodon，这是一种用于高效序列建模的神经架构，具有无限上下文长度。Megalodon 继承了 Mega 的架构（带有门控注意力的指数移动平均），并进一步引入了多个技术组件来提高其能力和稳定性，包括复杂指数移动平均（CEMA）、时间步归一化层、归一化注意力机制和具有两跳残差配置的预归一化。在与 Llama2 的对照性能比较中，Megalodon 在拥有 70 亿参数和 2 万亿训练标记的规模上比 Transformer 实现了更好的效率。Megalodon 的训练损失达到了 1.70，在 Llama2-7B（1.75）和 13B（1.67）之间。源代码：https://github.com/XuezheMax/megalodon

English

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon

Megalodon：具有无限上下文长度的高效LLM预训练和推理

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

摘要

Summary

Support

Support