포겟팅 트랜스포머: 포겟 게이트가 적용된 소프트맥스 어텐션

초록

현대 순환 시퀀스 모델의 필수 구성 요소 중 하나는 망각 게이트(forget gate)입니다. 트랜스포머(Transformer)는 명시적인 순환 구조를 가지고 있지 않지만, 우리는 데이터에 의존적인 방식으로 정규화되지 않은 어텐션 점수를 낮추는 방법을 통해 망각 게이트를 자연스럽게 통합할 수 있음을 보여줍니다. 이 어텐션 메커니즘을 "망각 어텐션(Forgetting Attention)"이라고 명명하고, 이를 적용한 모델을 "망각 트랜스포머(Forgetting Transformer, FoX)"라고 부릅니다. FoX는 장문 맥락 언어 모델링, 길이 외삽, 그리고 단문 맥락 하위 작업에서 트랜스포머를 능가하는 성능을 보이며, 장문 맥락 하위 작업에서는 트랜스포머와 동등한 성능을 보입니다. 또한, FoX는 FlashAttention 알고리즘과 호환되며 위치 임베딩을 필요로 하지 않습니다. 바늘 찾기 테스트(needle-in-the-haystack test)를 포함한 여러 분석을 통해 FoX는 Mamba-2, HGRN2, DeltaNet과 같은 순환 시퀀스 모델에 비해 트랜스포머의 우수한 장문 맥락 능력을 유지함을 확인했습니다. 또한, 순환 시퀀스 모델에서 흔히 사용되는 몇 가지 아키텍처 구성 요소를 통합한 "Pro" 블록 설계를 소개하며, 이는 FoX와 트랜스포머 모두의 성능을 크게 향상시킴을 발견했습니다. 우리의 코드는 https://github.com/zhixuan-lin/forgetting-transformer에서 확인할 수 있습니다.

English

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

포겟팅 트랜스포머: 포겟 게이트가 적용된 소프트맥스 어텐션

Forgetting Transformer: Softmax Attention with a Forget Gate

초록

Support