高效LLM:大型语言模型中的效率优化
EfficientLLM: Efficiency in Large Language Models
May 20, 2025
作者: Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye
cs.AI
摘要
大型语言模型(LLMs)已推动显著进展,然而其不断增长的参数规模与上下文窗口导致了计算、能源及经济成本的急剧上升。本文提出EfficientLLM,一个创新基准及首次大规模评估LLM效率技术的全面实证研究。研究在一生产级集群(48xGH200,8xH200 GPU)上进行,系统探索了三大关键维度:(1) 架构预训练(高效注意力变体:MQA、GQA、MLA、NSA;稀疏专家混合模型MoE),(2) 微调(参数高效方法:LoRA、RSLoRA、DoRA),以及(3) 推理(量化方法:int4、float16)。我们定义了六项细粒度指标(内存利用率、计算利用率、延迟、吞吐量、能耗、压缩率)以捕捉硬件饱和度、延迟-吞吐平衡及碳成本。通过评估超过100种模型-技术组合(0.5B至72B参数),得出三项核心发现:(i) 效率涉及可量化的权衡:无单一方法普遍最优;例如,MoE减少浮点运算并提升精度,但VRAM增加40%,而int4量化在精度下降3-5%的情况下,内存/能耗最多减少3.9倍。(ii) 最优选择依赖于任务与规模:MQA为受限设备提供最佳内存-延迟权衡,MLA在质量关键任务中达到最低困惑度,而RSLoRA仅在超过14B参数时超越LoRA效率。(iii) 技术跨模态通用:我们将评估扩展至大型视觉模型(Stable Diffusion 3.5、Wan 2.1)及视觉-语言模型(Qwen2.5-VL),证实了有效迁移性。通过开源数据集、评估流程及排行榜,EfficientLLM为研究人员与工程师在下一代基础模型的效率-性能探索中提供了关键指导。
English
Large Language Models (LLMs) have driven significant progress, yet their
growing parameter counts and context windows incur prohibitive compute, energy,
and monetary costs. We introduce EfficientLLM, a novel benchmark and the first
comprehensive empirical study evaluating efficiency techniques for LLMs at
scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our
study systematically explores three key axes: (1) architecture pretraining
(efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts
(MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and
(3) inference (quantization methods: int4, float16). We define six fine-grained
metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy
Consumption, Compression Rate) to capture hardware saturation,
latency-throughput balance, and carbon cost. Evaluating over 100
model-technique pairs (0.5B-72B parameters), we derive three core insights: (i)
Efficiency involves quantifiable trade-offs: no single method is universally
optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by
40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5%
accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal
memory-latency trade-offs for constrained devices, MLA achieves lowest
perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency
only beyond 14B parameters. (iii) Techniques generalize across modalities: we
extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and
Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By
open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM
provides essential guidance for researchers and engineers navigating the
efficiency-performance landscape of next-generation foundation models.