差动变压器
Differential Transformer
October 7, 2024
作者: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
cs.AI
摘要
Transformer往往会将注意力过多地分配给无关的上下文。在这项工作中,我们引入了Diff Transformer,它在放大与相关上下文的注意力的同时抑制噪音。具体来说,差分注意力机制通过计算两个独立softmax注意力图之间的差异来计算注意力分数。减法消除了噪音,促进了稀疏注意力模式的出现。在语言建模的实验结果中显示,Diff Transformer在不同规模的模型和训练标记设置下优于Transformer。更有趣的是,它在实际应用中提供了明显的优势,如长上下文建模、关键信息检索、幻觉减轻、上下文学习以及减少激活异常值。通过减少对无关上下文的干扰,Diff Transformer可以减轻问答和文本摘要中的幻觉。对于上下文学习,Diff Transformer不仅提高了准确性,而且对于顺序排列更加鲁棒,这被认为是一个长期的鲁棒性问题。结果表明,Diff Transformer作为一种高效且有前景的架构,有助于推动大型语言模型的发展。
English
Transformer tends to overallocate attention to irrelevant context. In this
work, we introduce Diff Transformer, which amplifies attention to the relevant
context while canceling noise. Specifically, the differential attention
mechanism calculates attention scores as the difference between two separate
softmax attention maps. The subtraction cancels noise, promoting the emergence
of sparse attention patterns. Experimental results on language modeling show
that Diff Transformer outperforms Transformer in various settings of scaling up
model size and training tokens. More intriguingly, it offers notable advantages
in practical applications, such as long-context modeling, key information
retrieval, hallucination mitigation, in-context learning, and reduction of
activation outliers. By being less distracted by irrelevant context, Diff
Transformer can mitigate hallucination in question answering and text
summarization. For in-context learning, Diff Transformer not only enhances
accuracy but is also more robust to order permutation, which was considered as
a chronic robustness issue. The results position Diff Transformer as a highly
effective and promising architecture to advance large language models.Summary
AI-Generated Summary