MoA: 대규모 언어 모델 자동 압축을 위한 희소 어텐션 혼합 기법

초록

희소 주의(Sparse attention)는 긴 문맥에서 대규모 언어 모델(LLMs)의 상당한 메모리 및 처리량 요구를 효과적으로 완화할 수 있습니다. 기존 방법들은 일반적으로 균일한 희소 주의 마스크를 사용하여, 서로 다른 주의 헤드와 입력 길이에 동일한 희소 패턴을 적용합니다. 그러나 이러한 균일한 접근 방식은 LLMs에 내재된 다양한 주의 패턴을 포착하지 못하며, 그들의 독특한 정확도-지연 시간 트레이드오프를 무시합니다. 이 문제를 해결하기 위해, 우리는 서로 다른 헤드와 레이어에 맞춤형 희소 주의 구성을 자동으로 조정하는 주의 혼합(Mixture of Attention, MoA)을 제안합니다. MoA는 다양한 주의 패턴과 입력 시퀀스 길이에 대한 스케일링 규칙의 탐색 공간을 구성하고 탐색합니다. 이는 모델을 프로파일링하고, 잠재적인 구성을 평가하며, 최적의 희소 주의 압축 계획을 찾아냅니다. MoA는 다양한 입력 크기에 적응하며, 일부 주의 헤드는 더 긴 시퀀스를 수용하기 위해 초점을 확장하는 반면, 다른 헤드는 고정 길이의 로컬 문맥에 지속적으로 집중하는 것을 보여줍니다. 실험 결과, MoA는 동일한 평균 주의 범위에서 효과적인 문맥 길이를 3.9배 증가시키고, Vicuna-7B, Vicuna-13B 및 Llama3-8B 모델에서 균일 주의 기준선 대비 검색 정확도를 1.5-7.1배 향상시켰습니다. 또한, MoA는 희소 모델과 밀집 모델 간의 성능 격차를 좁혀, 두 가지 긴 문맥 이해 벤치마크에서 최대 상대 성능 하락을 9%-36%에서 5% 이내로 줄였습니다. MoA는 단일 GPU에서 7B 및 13B 밀집 모델에 대해 GPU 메모리를 1.2-1.4배 절감하고, 디코딩 처리량을 5.5-6.7배 증가시키며, 성능에 미치는 영향을 최소화했습니다.

English

Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9times with the same average attention span, boosting retrieval accuracy by 1.5-7.1times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4times GPU memory reduction and boosts decode throughput by 5.5-6.7 times for 7B and 13B dense models on a single GPU, with minimal impact on performance.

MoA: 대규모 언어 모델 자동 압축을 위한 희소 어텐션 혼합 기법

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

초록

Support