量化大型语言模型中的不确定性驱动社会偏见变化

摘要

后训练量化虽能降低大语言模型的计算成本，却会从根本上改变其社会偏见，这种变化是聚合指标无法捕捉的。我们首次对50个量化模型展开大规模研究，基于统一基准测试集PostTrainingBiasBench（包含13个封闭式和开放式偏见数据集）进行评估。研究发现了一种称为"量化诱发掩码式偏见翻转"的现象：尽管聚合偏见分数未变，但高达21%的响应会在量化后出现偏见与无偏见状态的翻转。这种翻转强烈受模型不确定性驱动，高不确定性响应的翻转概率是确定性响应的3-11倍。量化强度会放大该效应，4比特量化模型的行为变化比8比特模型多4-6倍。关键的是，这些变化会对不同人口群体产生不对称影响——某些群体的偏见可能恶化高达18.6%，而其他群体却改善14.1%，导致聚合结果呈现误导性的中立。更大模型并未展现出稳定的鲁棒性优势，且群体特异性偏移在不同模型家族间呈现不可预测的波动。我们的研究证明：压缩技术会根本性改变偏见模式，必须进行关键的后量化评估与干预，才能确保实际应用的可靠性。

English

Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.

量化大型语言模型中的不确定性驱动社会偏见变化

Uncertainty Drives Social Bias Changes in Quantized Large Language Models

摘要

Support