刺猬与豪猪：使用Softmax模拟表达线性注意力

摘要

线性注意力已显示出提高Transformer效率的潜力，将注意力的二次复杂度降低为与序列长度成线性关系。这为以下方面带来了令人兴奋的前景：(1) 从头开始训练线性Transformer，(2) 将特定任务的Transformer进行“微调转换”为线性版本以恢复任务性能，以及(3) 将大型语言模型等Transformer进行“预训练转换”为可在下游任务上进行微调的线性版本。然而，线性注意力在质量上通常表现不如标准softmax注意力。为了弥补这一性能差距，我们发现先前的线性注意力缺乏与良好性能相关的softmax注意力的关键属性：低熵（或“尖锐”）权重和点积单调性。我们进一步观察到一种令人惊讶的简单特征映射，保留了这些属性并与softmax性能相匹配，但在线性注意力中计算效率低下。因此，我们提出了Hedgehog，一种可学习的线性注意力，保留了softmax注意力的尖锐和单调特性，同时保持线性复杂度。Hedgehog使用简单可训练的MLP来生成模仿softmax注意力的注意力权重。实验表明，Hedgehog在从头开始训练和微调转换设置中恢复了超过99%的标准Transformer质量，在WikiText-103上与因果GPT相比，比先前的线性注意力高出多达6个困惑度点，在微调的双向BERT上高出多达8.7个GLUE分数点。Hedgehog还实现了预训练转换。将预训练的GPT-2转换为线性注意力变体，在125M次二次解码器模型上，实现了在WikiText-103上的最先进16.7的困惑度。最后，我们将预训练的Llama-2 7B转换为可行的线性注意力Llama。通过低秩适应，Hedgehog-Llama2 7B在ROUGE-1分数上比基本标准注意力模型高出28.1个点，而先前的线性注意力导致16.5个点的下降。

English

Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as large language models into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

刺猬与豪猪：使用Softmax模拟表达线性注意力

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

摘要

Support