LongNet：將Transformer擴展至10億標記

摘要

在大型語言模型時代，調整序列長度已成為一個關鍵需求。然而，現有方法在計算複雜度或模型表達能力方面存在困難，導致最大序列長度受限。在這項工作中，我們介紹了LongNet，一種Transformer變體，可以將序列長度擴展到超過10億個標記，而不會影響對較短序列的性能。具體來說，我們提出了膨脹注意力，隨著距離增長，它會指數級擴展關注範圍。LongNet具有顯著優勢：1）具有線性計算複雜度和標記之間的對數依賴性；2）它可以作為極長序列的分佈式訓練器；3）其膨脹注意力可直接替換標準注意力，可以與現有基於Transformer的優化方案無縫集成。實驗結果表明，LongNet在長序列建模和一般語言任務上表現出色。我們的工作為建模非常長的序列開啟了新的可能性，例如，將整個語料庫甚至整個互聯網視為一個序列。

English

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

LongNet：將Transformer擴展至10億標記

LongNet: Scaling Transformers to 1,000,000,000 Tokens

摘要

Support