線性複雜度語言模型的擴展定律

摘要

對於大型語言模型的線性複雜度模型的興趣正在增加，儘管它們的擴展能力仍然不確定。在這項研究中，我們提出了線性複雜度語言模型的擴展定律，以建立它們可擴展性的基礎。具體而言，我們研究了三種高效的線性架構的擴展行為。這些包括具有資料獨立衰減的線性注意力模型 TNL；具有資料依賴衰減的線性 RNN 的 HGRN2；以及沒有衰減的線性注意力模型 cosFormer2。我們還將 LLaMA 作為基準架構，用於軟最大值注意力進行比較。這些模型在 300B 個標記的語料庫上，使用從 70M 到 7B 參數的六種變體進行訓練，並通過對各種下游任務的 1,376 個中間檢查點進行評估。這些任務包括驗證損失、常識推理、信息檢索和生成。研究表明，現有的線性複雜度語言模型展現出與傳統基於變壓器的模型類似的擴展能力，同時還表現出卓越的語言能力和知識保留。

English

The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.

線性複雜度語言模型的擴展定律

Scaling Laws for Linear Complexity Language Models

摘要

Support