价值残差学习用于缓解注意力集中在变压器中的问题
Value Residual Learning For Alleviating Attention Concentration In Transformers
October 23, 2024
作者: Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan
cs.AI
摘要
Transformer可以利用自注意力机制捕获长距离依赖关系,使得标记能够直接关注所有其他标记。然而,堆叠多个注意力层会导致注意力过度集中。解决这个问题的一种自然方式是使用跨层注意力,允许较早层的信息直接被后续层访问。然而,这种方法在计算上代价高昂。为了解决这个问题,我们提出了具有残差值(ResFormer)的Transformer,通过将第一层的值添加到所有后续层来近似跨层注意力。基于这种方法,一种变体是具有单层值(SVFormer)的Transformer,其中所有层共享来自第一层的相同值嵌入,将KV缓存减少了近50%。全面的实证证据表明,ResFormer减轻了深层中的注意力过度集中问题,并增强了大多数层的表示,优于基本Transformer、DenseFormer和NeuTRENO在训练错误以及下游任务中的表现。SVFormer的训练速度明显快于基本Transformer,并且优于其他方法如GQA和CLA,其性能受序列长度和累积学习率的影响。
English
Transformers can capture long-range dependencies using self-attention,
allowing tokens to attend to all others directly. However, stacking multiple
attention layers leads to attention concentration. One natural way to address
this issue is to use cross-layer attention, allowing information from earlier
layers to be directly accessible to later layers. However, this approach is
computationally expensive. To address this problem, we propose Transformer with
residual value (ResFormer) which approximates cross-layer attention through
adding a residual connection from the values of the the first layer to all
subsequent layers. Based on this method, one variant is the Transformer with
single layer value (SVFormer), where all layers share the same value embedding
from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical
evidence demonstrates that ResFormer mitigates attention concentration problem
in deeper layers and enhances representation across most layers, outperforming
the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as
downstream tasks. SVFormer trains significantly faster than the vanilla
Transformer and performs better than other methods like GQA and CLA, with
performance influenced by sequence length and cumulative learning rate.Summary
AI-Generated Summary