E-PMQ：採用合併權重錨定的專家引導合併後量化

摘要

低资源部署限制使得模型量化成为在保持性能的同时部署神经网络的必要手段。与此同时，模型融合已成为一种日益实用的低资源策略，能够将多个任务或领域专精的专家模型整合为单一模型，而无需联合训练或多模型服务。通过将多个专家模型融合至一个低比特模型中，量化与模型融合共同实现了高效的低资源部署流程。我们将这一设定定义为融合后量化（Post-Merge Quantization, PMQ）。研究表明，直接对融合模型应用训练后量化（Post-Training Quantization, PTQ）并不可靠，因为两种不同的偏差会耦合在一起：由低比特重建引入的量化偏差，以及源自模型融合的专家相对融合偏差。为减轻这些偏差，我们提出E-PMQ，一种专家引导的PMQ框架。该框架利用源专家权重，在逐层校准过程中提供专家引导的输出目标，并结合融合权重锚定以稳定校准过程，保留融合模型的整合行为。在CLIP-ViT-B/32的八任务融合场景中，E-PMQ在任务算术（Task Arithmetic）方法下将4比特GPTQ从65.0%提升至73.6%，在TIES-Merging方法下从69.1%提升至74.8%。在更具挑战性的设定下，E-PMQ在20任务的CLIP-ViT-L/14上，将GPTQ从34.8%提升至76.7%；在FLAN-T5-base的GLUE任务上，从78.26%提升至83.34%。这些结果表明，E-PMQ能够实现有效的融合后量化与低比特部署。

English

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.