遗忘Transformer：带遗忘门的Softmax注意力机制

摘要

現代循環序列模型的一個關鍵組件是遺忘門。雖然Transformer並未採用顯式的循環結構，但我們展示了一種方法，可以通過數據依賴的方式對未歸一化的注意力分數進行降權，從而自然地將遺忘門融入Transformer中。我們將這種注意力機制命名為“遺忘注意力”，並將由此產生的模型稱為“遺忘Transformer”（FoX）。我們證明，FoX在長上下文語言建模、長度外推以及短上下文下游任務上均優於Transformer，而在長上下文下游任務上則與Transformer表現相當。此外，它與FlashAttention算法兼容，且無需任何位置嵌入。多項分析，包括“大海撈針”測試，表明FoX也保留了Transformer相較於Mamba-2、HGRN2和DeltaNet等循環序列模型在長上下文處理上的優勢。我們還引入了一種“Pro”模塊設計，該設計整合了循環序列模型中的一些常見架構組件，並發現它顯著提升了FoX和Transformer的性能。我們的代碼已開源於https://github.com/zhixuan-lin/forgetting-transformer。

English

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

遗忘Transformer：带遗忘门的Softmax注意力机制

Forgetting Transformer: Softmax Attention with a Forget Gate

摘要

Support