長上下文大語言模型的成本最優分組查詢注意力機制

摘要

構建高效且有效的基於Transformer的大型語言模型（LLMs）近期已成為研究焦點，這需要在最大化模型語言能力的同時，最小化訓練與部署成本。現有研究主要描述了模型性能、參數規模與數據規模之間的複雜關係，並尋求訓練LLMs的最佳計算資源分配方案。然而，這些研究忽略了上下文長度及注意力頭配置（在分組查詢注意力中查詢與鍵值頭的數量）對訓練與推理的影響。本文中，我們系統地比較了不同參數規模、上下文長度及注意力頭配置的模型在性能、計算成本與內存成本上的表現。隨後，我們擴展了僅基於參數規模與訓練計算的現有縮放方法，以指導在訓練與推理階段構建成本最優的LLMs。我們的定量縮放研究表明，在處理足夠長的序列時，擁有較少注意力頭的更大模型能夠實現更低的損失，同時產生更低的計算與內存成本。這些發現為開發實用的LLMs，尤其是在長上下文處理場景中，提供了寶貴的見解。我們將公開我們的代碼與數據。

English

Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.

長上下文大語言模型的成本最優分組查詢注意力機制

Cost-Optimal Grouped-Query Attention for Long-Context LLMs

摘要

Support