Megalodon:具有无限上下文长度的高效LLM预训练和推理
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
April 12, 2024
作者: Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
cs.AI
摘要
Transformer 模型的二次复杂度和弱长度外推限制了它们在长序列中扩展的能力,虽然存在线性注意力和状态空间模型等次二次解决方案,但它们在预训练效率和下游任务准确性方面在实证上表现不如 Transformers。我们引入了Megalodon,这是一种用于高效序列建模的神经架构,具有无限上下文长度。Megalodon 继承了 Mega 的架构(带有门控注意力的指数移动平均),并进一步引入了多个技术组件来提高其能力和稳定性,包括复杂指数移动平均(CEMA)、时间步归一化层、归一化注意力机制和具有两跳残差配置的预归一化。在与 Llama2 的对照性能比较中,Megalodon 在拥有 70 亿参数和 2 万亿训练标记的规模上比 Transformer 实现了更好的效率。Megalodon 的训练损失达到了 1.70,在 Llama2-7B(1.75)和 13B(1.67)之间。源代码:https://github.com/XuezheMax/megalodon
English
The quadratic complexity and weak length extrapolation of Transformers limits
their ability to scale to long sequences, and while sub-quadratic solutions
like linear attention and state space models exist, they empirically
underperform Transformers in pretraining efficiency and downstream task
accuracy. We introduce Megalodon, a neural architecture for efficient sequence
modeling with unlimited context length. Megalodon inherits the architecture of
Mega (exponential moving average with gated attention), and further introduces
multiple technical components to improve its capability and stability,
including complex exponential moving average (CEMA), timestep normalization
layer, normalized attention mechanism and pre-norm with two-hop residual
configuration. In a controlled head-to-head comparison with Llama2, Megalodon
achieves better efficiency than Transformer in the scale of 7 billion
parameters and 2 trillion training tokens. Megalodon reaches a training loss of
1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code:
https://github.com/XuezheMax/megalodonSummary
AI-Generated Summary