UniQL: エッジLLM適応のための統合量子化と低ランク圧縮

要旨

大規模言語モデル（LLM）をモバイルプラットフォームに展開する際には、デバイスのメモリ制約や計算リソースの共有環境により、大きな課題が生じる。リソース可用性は現在のデバイス負荷に直接影響を受けるため不安定であり、モデル展開の不確実性を増大させる。本論文では、エッジLLM向けにオンデバイスで設定可能なプルーニング率を備えた統一的な学習後量子化および低ランク圧縮フレームワーク「UniQL」を提案する。UniQLは、Transformer、状態空間モデル（SSM）、ハイブリッドモデルに対応する量子化と低ランク圧縮を統合した汎用フレームワークであり、多様なエッジアプリケーションをサポートする。提案する統合フレームワークでは、計算速度を20倍高速化する効率的な構造化重みソート手法、量子化誤差を最小化する量子化考慮型特異値分解（SVD）、SSM向けの状態考慮型重みソート、およびプルーニング済みモデル向けの融合型RoPEカーネルを導入する。本フレームワークは、重みソート、ファインチューニング、量子化をクラウド上で単一パスのワークフローで実行するとともに、オンデバイスで最大35%までの設定可能なプルーニング率を実現する。実験結果では、量子化およびプルーニングを施したモデルが、Transformer（Llama3、Qwen2.5）、SSM（Mamba2）、ハイブリッドモデル（Nemotron-H、Bamba-v2）において、15%のプルーニング時に元のモデルとの精度差を5%以内に維持しつつ、メモリ使用量を4～5.7倍削減、トークン処理スループットを2.7～3.4倍向上させることを示す。コードと量子化済みモデルはhttps://github.com/enyac-group/UniQL で公開されている。

English

Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.

UniQL: エッジLLM適応のための統合量子化と低ランク圧縮

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

要旨

Support