UniQL:面向自适应边缘大语言模型的统一量化与低秩压缩方法
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
December 3, 2025
作者: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
cs.AI
摘要
在移动平台上部署大型语言模型(LLM)面临重大挑战,主要受限于设备有限的内存和共享计算资源。由于资源可用性直接受当前设备负载影响,模型部署的不确定性进一步加剧。我们提出UniQL——一个支持设备端可配置剪枝率的统一后训练量化与低秩压缩框架,专为边缘LLM设计。该通用框架集成了针对Transformer、状态空间模型(SSM)及混合模型的量化与低秩压缩技术,以支持多样化的边缘应用。在我们的联合框架中,我们引入了高效结构化权重排序方法(计算速度提升20倍)、量化感知奇异值分解(SVD)以最小化量化误差、针对SSM的状态感知权重排序技术,以及面向剪枝模型的融合式旋转位置编码(RoPE)内核。我们的框架在云端单次工作流中完成权重排序、微调与量化,同时支持设备端最高达35%的可配置剪枝率。实验表明,经过量化与剪枝的模型在Transformer(Llama3、Qwen2.5)、SSM(Mamba2)和混合模型(Nemotron-H、Bamba-v2)上,内存占用减少4-5.7倍,令牌吞吐量提升2.7-3.4倍,且在15%剪枝率下精度损失控制在原模型5%以内。代码与量化模型已开源:https://github.com/enyac-group/UniQL。
English
Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.