Transformer内の注意集中を緩和するための価値残差学習

要旨

Transformerは、自己注意を使用して長距離の依存関係を捉えることができ、トークンが直接他のすべてに注意を払うことができます。ただし、複数の注意層を積み重ねると、注意の集中が生じます。この問題に対処する自然な方法の1つは、クロスレイヤーの注意を使用することで、初期のレイヤーからの情報を後のレイヤーが直接アクセスできるようにすることです。ただし、このアプローチは計算コストが高いです。この問題に対処するために、私たちはResidual Value（ResFormer）を提案します。これは、最初のレイヤーの値からすべての後続のレイヤーへの残差接続を追加することで、クロスレイヤーの注意を近似します。この手法に基づいて、1つの変種として、最初のレイヤーからすべてのレイヤーが同じ値の埋め込みを共有するTransformer with single layer value（SVFormer）があります。これにより、KVキャッシュをほぼ50％削減できます。包括的な実証的証拠によると、ResFormerはより深いレイヤーにおける注意の集中問題を軽減し、ほとんどのレイヤーで表現を向上させ、トレーニングエラーや下流タスクにおいて、通常のTransformer、DenseFormer、NeuTRENOを上回ります。SVFormerは、通常のTransformerよりもトレーニングがはるかに速く、GQAやCLAなどの他の手法よりも優れたパフォーマンスを発揮し、シーケンス長や累積学習率によってパフォーマンスが影響を受けます。

English

Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

Transformer内の注意集中を緩和するための価値残差学習

Value Residual Learning For Alleviating Attention Concentration In Transformers

要旨

Support