UniQL:面向自适应边缘大语言模型的统一量化与低秩压缩技术
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
December 3, 2025
作者: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
cs.AI
摘要
在移动平台上部署大型语言模型面临显著挑战,主要源于设备有限的内存和共享计算资源。由于资源可用性直接受设备当前工作负载影响,这增加了模型部署的不确定性。我们提出UniQL——一个支持设备端可配置剪枝率的统一后训练量化和低秩压缩框架。该通用框架集成了针对Transformer、状态空间模型及混合模型的量化与低秩压缩技术,以支持多样化的边缘应用。在我们的联合框架中,我们引入了高效结构化权重排序方法(计算速度提升20倍)、量化感知奇异值分解(最小化量化误差)、SSM状态感知权重排序技术,以及面向剪枝模型的融合式旋转位置编码内核。该框架通过单流程在云端完成权重排序、微调与量化,同时支持设备端最高35%的可配置剪枝率。实验表明,经过量化与剪枝的模型在Transformer、SSM和混合模型上实现了4-5.7倍的内存压缩和2.7-3.4倍的令牌吞吐量提升,在15%剪枝率下精度损失控制在原模型5%以内。相关代码与量化模型已开源:https://github.com/enyac-group/UniQL。
English
Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.