Metis：運用先進低位元量化技術訓練大型語言模型

摘要

本研究揭示了各向異性參數分佈作為訓練低比特量化大型語言模型（LLMs）的根本障礙：少數主導奇異值產生的寬廣數值範圍與塊級量化的固有偏見相衝突。這種偏見不成比例地保留高幅值而捨棄較小值，導致訓練不穩定和模型性能低下。本文提出Metis訓練框架，結合以下三點：(i) 譜分解與隨機嵌入，有效分離主導成分與長尾成分，將寬廣分佈壓縮至適合量化的狹窄範圍；(ii) 在譜域中採用自適應學習率，放大未被充分代表的方向，更好地捕捉對性能至關重要的多樣特徵；(iii) 雙範圍正則化器，聯合約束數值精度與參數範圍分佈，確保穩定、無偏的低比特訓練。藉助Metis，FP8訓練超越FP32基線，FP4訓練達到與FP32相當的精度，為在先進低比特量化下實現穩健且可擴展的LLM訓練鋪平道路。Metis的代碼實現已公開於：https://github.com/typename-yyf/Metis-quantization。

English

This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: https://github.com/typename-yyf/Metis-quantization.

Metis：運用先進低位元量化技術訓練大型語言模型

Metis: Training Large Language Models with Advanced Low-Bit Quantization

摘要

Support