그룹화된 질의 전문가: GQA 자기 주의를 위한 혼합 전문가 모델

초록

자가 주의(self-attention)는 트랜스포머(Transformer) 성능의 핵심이며, 긴 컨텍스트 길이에서 쌍별 토큰 상호작용이 시퀀스 길이에 대해 이차적으로 확장되기 때문에 종종 트랜스포머에서 가장 비용이 많이 드는 부분이다. 표준 밀집 attention은 또한 토큰의 난이도나 정보량과 관계없이 모든 토큰에 동일한 attention 헤드 세트를 적용한다. 이러한 균일한 활성화는 연산을 낭비할 수 있으며, 특히 시퀀스가 길어지고 attention 비용이 급격히 증가할수록 더욱 그러하다. 우리는 그룹화된 쿼리 전문가(GQE, Grouped Query Experts)를 제안하는데, 이는 그룹화된 쿼리 attention(GQA, grouped-query attention) 위에 전문가 혼합(mixture-of-experts) 계층을 추가한 것이다. 각 GQA 그룹 내에서 라우터는 토큰별로 k개의 쿼리 헤드 전문가를 선택하는 반면, 모든 키-값(KV) 헤드는 밀집 상태를 유지하며 변경되지 않는다. 따라서 GQE는 GQA의 KV 캐시 이점을 유지하면서 활성 쿼리 헤드 계산만 줄인다. 250M 파라미터 규모에서 고정된 30B 토큰 예산 하에, GQE는 토큰당 절반의 쿼리 헤드만 활성화하면서도 모든 쿼리 헤드를 활성화하는 GQA 기준선과 하위 작업 정확도에서 동등한 성능을 보인다.

English

Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.