差动变压器

摘要

Transformer往往会将注意力过多地分配给无关的上下文。在这项工作中，我们引入了Diff Transformer，它在放大与相关上下文的注意力的同时抑制噪音。具体来说，差分注意力机制通过计算两个独立softmax注意力图之间的差异来计算注意力分数。减法消除了噪音，促进了稀疏注意力模式的出现。在语言建模的实验结果中显示，Diff Transformer在不同规模的模型和训练标记设置下优于Transformer。更有趣的是，它在实际应用中提供了明显的优势，如长上下文建模、关键信息检索、幻觉减轻、上下文学习以及减少激活异常值。通过减少对无关上下文的干扰，Diff Transformer可以减轻问答和文本摘要中的幻觉。对于上下文学习，Diff Transformer不仅提高了准确性，而且对于顺序排列更加鲁棒，这被认为是一个长期的鲁棒性问题。结果表明，Diff Transformer作为一种高效且有前景的架构，有助于推动大型语言模型的发展。

English

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

差动变压器

Differential Transformer

摘要

Support