选择性注意力改进了Transformer。

摘要

在注意力机制的背景中不必要的元素会降低性能。我们引入了选择性注意力，这是对标准注意力机制的一个简单且无需参数的改变，可以减少对不必要元素的关注。选择性注意力提高了各种模型大小和上下文长度下的语言建模性能。例如，在C4上使用语言建模目标训练的一系列Transformer模型，通过选择性注意力的表现与标准Transformer模型相当，而后者在注意力模块中拥有大约两倍的头数和参数。选择性注意力还允许减少注意力上下文缓冲区的大小，在推断过程中可以显著减少内存和计算需求。例如，在C4上训练具有1亿参数的Transformer模型，当配备选择性注意力时，其注意力模块的内存需求分别比不使用选择性注意力的模型减少了16倍、25倍和47倍，且验证困惑度相同。

English

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.

选择性注意力改进了Transformer。

Selective Attention Improves Transformer

摘要

Support