分组查询专家：基于GQA自注意力的混合专家模型

摘要

自注意力机制是Transformer性能的核心，在长上下文长度下通常是Transformer中最昂贵的部分，因为其逐对词元交互的计算量随序列长度呈二次方增长。标准稠密注意力对所有词元不加区分地应用相同的注意力头集合，而不考虑词元的难度或信息含量。这种统一激活方式可能浪费计算资源，尤其是在序列变长、注意力成本迅速增加时。我们提出分组查询专家（GQE），这是一个基于分组查询注意力（GQA）的混合专家层。在每个GQA组内，路由为每个词元选择k个查询头专家，而所有键值头保持稠密且不变。因此，GQE保留了GQA的KV缓存优势，仅减少了活跃查询头的计算量。在250M参数规模、固定300亿词元预算下，GQE在下游任务准确性上与全活跃GQA基线持平，同时每个词元仅激活一半的查询头。

English

Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.