E-PMQ: 専門家誘導によるマージ後量子化とマージ重みアンカリング

要旨

低リソース環境での展開制約により、ニューラルネットワークの性能を維持しつつ展開するためにはモデル量子化が不可欠となっている。一方、モデルマージは、ジョイントトレーニングや複数モデルのサービス提供を必要とせずに、複数のタスク特化型またはドメイン特化型のエキスパートを単一モデルに統合する、実用的な低リソース戦略として重要性を増している。量子化とモデルマージを組み合わせることで、複数のエキスパートを単一の低ビットモデルに統合し、効率的な低リソース展開パイプラインが実現される。本研究では、この設定をマージ後量子化（Post-Merge Quantization, PMQ）と定義する。マージ後のモデルに学習後量子化（Post-Training Quantization, PTQ）を直接適用することは信頼性に欠けることを示す。なぜなら、低ビット再構成により導入される量子化誤差と、モデルマージから継承されるエキスパート間のマージ誤差という、2つの異なる誤差が結合するためである。これらの誤差を軽減するために、我々はE-PMQを提案する。これはエキスパート誘導型のPMQフレームワークであり、層ごとのキャリブレーションにおいて、ソースエキスパートの重みを用いてエキスパート誘導出力ターゲットを提供し、さらにマージ重みアンカリングによりキャリブレーションを安定化し、マージモデルの統合された振る舞いを保持する。CLIP-ViT-B/32の8タスクマージにおいて、E-PMQはTask Arithmetic下での4ビットGPTQを65.0%から73.6%に、TIES-Merging下では69.1%から74.8%に改善する。より困難な設定では、E-PMQは20タスクのCLIP-ViT-L/14においてGPTQを34.8%から76.7%に、FLAN-T5-baseのGLUEにおいて78.26%から83.34%に改善する。これらの結果は、E-PMQが効果的なマージ後量子化と低ビット展開を実現することを示している。

English

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.