ChatPaper.aiChatPaper

规模法则与模型架构:迈向高效推理的大语言模型

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

October 21, 2025
作者: Song Bian, Tao Yu, Shivaram Venkataraman, Youngsuk Park
cs.AI

摘要

通过增加参数规模和训练数据量已被证明是提升大语言模型(LLM)性能的有效策略。然而,随着这些模型能力日益强大且部署范围扩大,推理成本已成为紧迫问题。尽管模型精度与推理效率之间的权衡至关重要,但这一领域仍缺乏深入探索。本研究系统分析了隐藏层维度、MLP与注意力模块的参数分配比(mlp-to-attention ratio)以及分组查询注意力(GQA)等关键架构因素对推理成本与模型精度的影响。我们提出了一种条件缩放定律,将架构信息融入Chinchilla框架,并开发了用于同步优化推理效率与精度的架构搜索框架。为验证方法有效性,我们训练了超过200个参数规模从80M到3B、训练令牌数从8B到100B的模型,并拟合了所提出的条件缩放定律。实验表明:该条件缩放定律能可靠预测最优架构选择,所得模型性能优于现有开源基线。在相同训练预算下,优化架构相比LLaMA-3.2可实现最高2.1%的精度提升和42%的推理吞吐量增益。
English
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
PDF62December 2, 2025