장문 맥락 LLM을 위한 비용 최적화 그룹화 쿼리 어텐션

초록

효율적이고 효과적인 Transformer 기반 대규모 언어 모델(LLM) 구축은 최근 연구의 주요 초점으로, 모델의 언어 능력을 극대화하면서도 학습 및 배포 비용을 최소화하는 것이 요구되고 있다. 기존 연구들은 주로 모델 성능, 파라미터 크기, 데이터 크기 간의 복잡한 관계를 설명하고, LLM 학습을 위한 최적의 컴퓨팅 자원 할당을 탐색하는 데 집중해왔다. 그러나 이러한 연구들은 컨텍스트 길이와 어텐션 헤드 구성(그룹화된 쿼리 어텐션에서의 쿼리 및 키-값 헤드의 수)이 학습 및 추론에 미치는 영향을 간과해왔다. 본 논문에서는 다양한 파라미터 크기, 컨텍스트 길이, 어텐션 헤드 구성을 가진 모델들을 모델 성능, 계산 비용, 메모리 비용 측면에서 체계적으로 비교한다. 또한, 기존의 파라미터 크기와 학습 컴퓨팅 자원에 기반한 스케일링 방법을 확장하여 학습 및 추론 과정에서 비용 최적의 LLM 구축을 위한 가이드라인을 제시한다. 우리의 정량적 스케일링 연구 결과, 충분히 긴 시퀀스를 처리할 때 더 적은 수의 어텐션 헤드를 가진 더 큰 모델이 더 낮은 손실을 달성하면서도 더 낮은 계산 및 메모리 비용을 발생시킬 수 있음을 보여준다. 이러한 발견은 특히 긴 컨텍스트 처리 시나리오에서 실용적인 LLM 개발을 위한 귀중한 통찰을 제공한다. 우리는 코드와 데이터를 공개할 예정이다.

English

Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.

장문 맥락 LLM을 위한 비용 최적화 그룹화 쿼리 어텐션

Cost-Optimal Grouped-Query Attention for Long-Context LLMs

초록

Support