价值残差学习用于缓解注意力集中在变压器中的问题

摘要

Transformer可以利用自注意力机制捕获长距离依赖关系，使得标记能够直接关注所有其他标记。然而，堆叠多个注意力层会导致注意力过度集中。解决这个问题的一种自然方式是使用跨层注意力，允许较早层的信息直接被后续层访问。然而，这种方法在计算上代价高昂。为了解决这个问题，我们提出了具有残差值（ResFormer）的Transformer，通过将第一层的值添加到所有后续层来近似跨层注意力。基于这种方法，一种变体是具有单层值（SVFormer）的Transformer，其中所有层共享来自第一层的相同值嵌入，将KV缓存减少了近50%。全面的实证证据表明，ResFormer减轻了深层中的注意力过度集中问题，并增强了大多数层的表示，优于基本Transformer、DenseFormer和NeuTRENO在训练错误以及下游任务中的表现。SVFormer的训练速度明显快于基本Transformer，并且优于其他方法如GQA和CLA，其性能受序列长度和累积学习率的影响。

English

Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

价值残差学习用于缓解注意力集中在变压器中的问题

Value Residual Learning For Alleviating Attention Concentration In Transformers

摘要

Support