高效LLM：大型语言模型中的效率优化

摘要

大型语言模型（LLMs）已推动显著进展，然而其不断增长的参数规模与上下文窗口导致了计算、能源及经济成本的急剧上升。本文提出EfficientLLM，一个创新基准及首次大规模评估LLM效率技术的全面实证研究。研究在一生产级集群（48xGH200，8xH200 GPU）上进行，系统探索了三大关键维度：(1) 架构预训练（高效注意力变体：MQA、GQA、MLA、NSA；稀疏专家混合模型MoE），(2) 微调（参数高效方法：LoRA、RSLoRA、DoRA），以及(3) 推理（量化方法：int4、float16）。我们定义了六项细粒度指标（内存利用率、计算利用率、延迟、吞吐量、能耗、压缩率）以捕捉硬件饱和度、延迟-吞吐平衡及碳成本。通过评估超过100种模型-技术组合（0.5B至72B参数），得出三项核心发现：(i) 效率涉及可量化的权衡：无单一方法普遍最优；例如，MoE减少浮点运算并提升精度，但VRAM增加40%，而int4量化在精度下降3-5%的情况下，内存/能耗最多减少3.9倍。(ii) 最优选择依赖于任务与规模：MQA为受限设备提供最佳内存-延迟权衡，MLA在质量关键任务中达到最低困惑度，而RSLoRA仅在超过14B参数时超越LoRA效率。(iii) 技术跨模态通用：我们将评估扩展至大型视觉模型（Stable Diffusion 3.5、Wan 2.1）及视觉-语言模型（Qwen2.5-VL），证实了有效迁移性。通过开源数据集、评估流程及排行榜，EfficientLLM为研究人员与工程师在下一代基础模型的效率-性能探索中提供了关键指导。

English

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

高效LLM：大型语言模型中的效率优化

EfficientLLM: Efficiency in Large Language Models

摘要

Support