ChatPaper.aiChatPaper

高效LLM:大型语言模型的效能优化

EfficientLLM: Efficiency in Large Language Models

May 20, 2025
作者: Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye
cs.AI

摘要

大型语言模型(LLMs)推动了显著进展,但其不断增长的参数量和上下文窗口带来了高昂的计算、能源和资金成本。我们引入EfficientLLM,这是一个新颖的基准测试,也是首个全面评估大规模LLM效率技术的实证研究。研究在一个生产级集群(48xGH200,8xH200 GPU)上进行,系统地探索了三个关键维度:(1) 架构预训练(高效注意力变体:MQA、GQA、MLA、NSA;稀疏专家混合模型(MoE)),(2) 微调(参数高效方法:LoRA、RSLoRA、DoRA),以及(3) 推理(量化方法:int4、float16)。我们定义了六个细粒度指标(内存利用率、计算利用率、延迟、吞吐量、能耗、压缩率)以捕捉硬件饱和度、延迟-吞吐量平衡和碳成本。通过评估超过100个模型-技术组合(0.5B-72B参数),我们得出三个核心见解:(i) 效率涉及可量化的权衡:没有单一方法普遍最优;例如,MoE减少了FLOPs并提高了准确性,但增加了40%的VRAM,而int4量化将内存/能耗降低了最多3.9倍,但准确性下降了3-5%。(ii) 最优解取决于任务和规模:MQA在受限设备上提供了最佳的内存-延迟权衡,MLA在质量关键任务中实现了最低的困惑度,而RSLoRA仅在超过14B参数时才超越LoRA的效率。(iii) 技术跨模态通用:我们将评估扩展到大型视觉模型(Stable Diffusion 3.5、Wan 2.1)和视觉-语言模型(Qwen2.5-VL),确认了有效的可迁移性。通过开源数据集、评估管道和排行榜,EfficientLLM为研究人员和工程师在下一代基础模型的效率-性能权衡中提供了重要指导。
English
Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

Summary

AI-Generated Summary

PDF171May 21, 2025