微分トランスフォーマー

要旨

Transformerは、関連性のない文脈に過剰な注意を払いがちです。本研究では、Diff Transformerを導入し、関連する文脈への注意を増幅させる一方でノイズをキャンセルします。具体的には、差分注意メカニズムは、2つの別々のソフトマックス注意マップの差として注意スコアを計算します。減算によりノイズが打ち消され、疎な注意パターンの出現が促進されます。言語モデリングの実験結果は、Diff Transformerがモデルサイズの拡大やトレーニングトークンの設定でTransformerを上回ることを示しています。さらに興味深いことに、長い文脈のモデリング、重要情報の検索、幻覚の軽減、文脈内学習、および活性化の外れ値の削減など、実用的なアプリケーションにおいて著しい利点を提供します。関連性のない文脈に気を取られることが少ないDiff Transformerは、質問応答やテキスト要約における幻覚を軽減することができます。文脈内学習において、Diff Transformerは精度を向上させるだけでなく、順序の置換に対してもより堅牢であり、慢性的な堅牢性の問題とされていた点であります。これらの結果により、Diff Transformerは大規模言語モデルを進化させるための非常に効果的で有望なアーキテクチャと位置付けられます。

English

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

微分トランスフォーマー

Differential Transformer

要旨

Support