线性复杂度语言模型的规模定律

摘要

对于大型语言模型，人们对线性复杂度模型的兴趣正在增加，尽管它们的扩展能力仍然不确定。在这项研究中，我们提出了线性复杂度语言模型的扩展定律，以建立它们可扩展性的基础。具体而言，我们研究了三种高效的线性架构的扩展行为。这些包括TNL，一个具有数据独立衰减的线性注意力模型；HGRN2，一个具有数据相关衰减的线性RNN；以及cosFormer2，一个没有衰减的线性注意力模型。我们还将LLaMA作为基线架构，用于softmax注意力的比较。这些模型在一个300B标记语料库上训练了从70M到7B参数的六个变体，并在各种下游任务上评估了总共1,376个中间检查点。这些任务包括验证损失、常识推理、信息检索和生成。研究表明，现有的线性复杂度语言模型表现出与传统基于Transformer的模型类似的扩展能力，同时还展示出更高的语言能力和知识保留能力。

English

The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.

线性复杂度语言模型的规模定律

Scaling Laws for Linear Complexity Language Models

摘要

Support