ChatPaper.aiChatPaper

高效LLM:大型语言模型中的效率优化

EfficientLLM: Efficiency in Large Language Models

May 20, 2025
作者: Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye
cs.AI

摘要

大型语言模型(LLMs)已推动显著进展,然而其不断增长的参数规模与上下文窗口导致了计算、能源及经济成本的急剧上升。本文提出EfficientLLM,一个创新基准及首次大规模评估LLM效率技术的全面实证研究。研究在一生产级集群(48xGH200,8xH200 GPU)上进行,系统探索了三大关键维度:(1) 架构预训练(高效注意力变体:MQA、GQA、MLA、NSA;稀疏专家混合模型MoE),(2) 微调(参数高效方法:LoRA、RSLoRA、DoRA),以及(3) 推理(量化方法:int4、float16)。我们定义了六项细粒度指标(内存利用率、计算利用率、延迟、吞吐量、能耗、压缩率)以捕捉硬件饱和度、延迟-吞吐平衡及碳成本。通过评估超过100种模型-技术组合(0.5B至72B参数),得出三项核心发现:(i) 效率涉及可量化的权衡:无单一方法普遍最优;例如,MoE减少浮点运算并提升精度,但VRAM增加40%,而int4量化在精度下降3-5%的情况下,内存/能耗最多减少3.9倍。(ii) 最优选择依赖于任务与规模:MQA为受限设备提供最佳内存-延迟权衡,MLA在质量关键任务中达到最低困惑度,而RSLoRA仅在超过14B参数时超越LoRA效率。(iii) 技术跨模态通用:我们将评估扩展至大型视觉模型(Stable Diffusion 3.5、Wan 2.1)及视觉-语言模型(Qwen2.5-VL),证实了有效迁移性。通过开源数据集、评估流程及排行榜,EfficientLLM为研究人员与工程师在下一代基础模型的效率-性能探索中提供了关键指导。
English
Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.
PDF241May 21, 2025