하이브리드 선형 어텐션의 체계적 분석

초록

Transformer는 긴 시퀀스에서 2차 복잡도와 메모리 문제에 직면하며, 이로 인해 고정 크기 은닉 상태를 사용하는 선형 어텐션 메커니즘의 도입이 촉진되었습니다. 그러나 선형 모델은 종종 제한된 리콜 성능으로 인해 어려움을 겪으며, 이는 선형 및 전체 어텐션 레이어를 결합한 하이브리드 아키텍처로 이어집니다. 광범위한 하이브리드 아키텍처 연구에도 불구하고, 선형 어텐션 구성 요소의 선택은 깊이 탐구되지 않았습니다. 우리는 벡터 재귀에서 고급 게이팅 메커니즘에 이르는 다양한 세대의 선형 어텐션 모델을 독립적으로 그리고 하이브리드화하여 체계적으로 평가합니다. 이러한 포괄적인 분석을 가능하게 하기 위해, 우리는 340M 파라미터(20B 토큰)와 1.3B 파라미터(100B 토큰)의 72개 모델을 학습하고 오픈소스로 공개했습니다. 이는 5가지 하이브리드화 비율에 걸쳐 6가지 선형 어텐션 변형을 포함합니다. 표준 언어 모델링 및 리콜 작업에 대한 벤치마킹 결과, 우수한 독립형 선형 모델이 하이브리드에서 반드시 뛰어나지는 않음을 보여줍니다. 언어 모델링은 선형 대 전체 어텐션 비율에 걸쳐 안정적으로 유지되는 반면, 리콜은 전체 어텐션 레이어가 증가함에 따라 특히 3:1 비율 미만에서 크게 개선됩니다. 우리의 연구는 선택적 게이팅, 계층적 재귀, 그리고 제어된 망각이 효과적인 하이브리드 모델에 있어 중요함을 강조합니다. 우리는 HGRN-2 또는 GatedDeltaNet과 같은 아키텍처를 3:1에서 6:1 사이의 선형 대 전체 비율로 사용하여 Transformer 수준의 리콜을 효율적으로 달성할 것을 권장합니다. 우리의 모델은 https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e에서 오픈소스로 제공됩니다.

English

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.

하이브리드 선형 어텐션의 체계적 분석

A Systematic Analysis of Hybrid Linear Attention

초록

Support