選択的注意はTransformerを改善する

要旨

注意の文脈に不要な要素があると、性能が低下します。私たちは、標準の注意メカニズムを変更するパラメータフリーのシンプルな手法である「選択的注意」を導入します。選択的注意は、不要な要素への注意を削減することで、さまざまなモデルサイズや文脈の長さにおける言語モデリングの性能を向上させます。例えば、C4で言語モデリング目的でトレーニングされた一連のトランスフォーマーは、選択的注意を備えた場合、標準のトランスフォーマーと同等の性能を発揮しますが、アテンションモジュールのヘッド数とパラメータ数が約2倍多いものです。選択的注意は、アテンションの文脈バッファのサイズを減らすことも可能であり、推論時のメモリと計算要件を有意な削減に導きます。例えば、C4でトレーニングされた1億のパラメータを持つトランスフォーマーは、選択的注意を備えた場合、同じ検証パープレキシティを持つ場合、アテンションモジュールに必要なメモリが512、1,024、2,048の文脈サイズの場合、それぞれ16倍、25倍、47倍少なくなります。

English

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.

選択的注意はTransformerを改善する

Selective Attention Improves Transformer

要旨

Support