효율적인 시퀀스 모델링을 위한 네이티브 하이브리드 어텐션

초록

트랜스포머(Transformers)는 시퀀스 모델링에서 뛰어난 성능을 보이지만, 이차 복잡도(quadratic complexity) 문제에 직면해 있습니다. 반면, 선형 어텐션(linear attention)은 효율성을 개선했지만, 긴 문맥에서의 리콜 정확도(recall accuracy)가 희생되는 경우가 많습니다. 본 연구에서는 선형 어텐션과 전체 어텐션(full attention)의 혼합 아키텍처인 Native Hybrid Attention(NHA)을 소개합니다. NHA는 계층 내(intra-layer) 및 계층 간(inter-layer) 혼합을 통합된 계층 설계로 통합한 새로운 구조입니다. NHA는 선형 RNN(Recurrent Neural Network)에 의해 업데이트되는 키-값 슬롯(key-value slots)에서 장기 문맥을 유지하고, 슬라이딩 윈도우(sliding window)에서 단기 토큰(short-term tokens)을 추가합니다. 그런 다음 모든 키와 값에 대해 단일 소프트맥스 어텐션(softmax attention) 연산을 적용하여, 추가적인 융합 파라미터(fusion parameters) 없이도 토큰별 및 헤드별 문맥 의존적 가중치(context-dependent weighting)를 가능하게 합니다. 계층 간 동작은 슬라이딩 윈도우 크기라는 단일 하이퍼파라미터를 통해 제어되며, 이는 모든 계층을 구조적으로 균일하게 유지하면서 순수 선형 어텐션과 전체 어텐션 사이를 원활하게 조정할 수 있게 합니다. 실험 결과, NHA는 리콜 집약적 작업 및 상식 추론(commonsense reasoning) 작업에서 트랜스포머 및 기타 혼합 베이스라인을 능가하는 성능을 보였습니다. 또한, 사전 훈련된 대형 언어 모델(LLMs)을 NHA와 구조적으로 혼합하면 경쟁력 있는 정확도를 유지하면서도 상당한 효율성 향상을 달성할 수 있습니다. 코드는 https://github.com/JusenD/NHA에서 확인할 수 있습니다.

English

Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single softmax attention operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

효율적인 시퀀스 모델링을 위한 네이티브 하이브리드 어텐션

Native Hybrid Attention for Efficient Sequence Modeling

초록

Support