量化大型語言模型中的不確定性驅動社會偏見變化

摘要

訓練後量化技術雖能降低大型語言模型的運算成本，卻會從根本上改變其社會偏見，這種變化是聚合指標無法捕捉的。我們針對50個量化模型進行首項大規模研究，並採用統一基準PostTrainingBiasBench——涵蓋13個封閉式與開放式偏見數據集。我們發現一種稱為「量化誘發掩蔽偏見翻轉」的現象：即便聚合偏見分數未變，高達21%的模型回應會在量化後於偏見與無偏見狀態間翻轉。這種翻轉強烈受模型不確定性驅動，高不確定性回應的翻轉機率比確定性回應高出3-11倍。量化強度會放大此效應，4位元量化模型的行為變化量是8位元模型的4-6倍。關鍵在於，這些變化對不同人口群體產生不對稱影響——某些群體的偏見惡化程度最高達18.6%，而其他群體卻改善14.1%，導致聚合結果呈現誤導性的中立。大型模型未展現一致的穩健性優勢，且群體特定偏移在不同模型家族間呈現不可預測的波動。我們的研究證實：模型壓縮會從本質上改變偏見模式，必須進行關鍵的量化後評估與干預，方能確保實際應用的可靠性。

English

Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.

量化大型語言模型中的不確定性驅動社會偏見變化

Uncertainty Drives Social Bias Changes in Quantized Large Language Models

摘要

Support