ChatPaper.aiChatPaper

缓解大型语言模型中的标签长度偏差

Mitigating Label Length Bias in Large Language Models

November 18, 2025
作者: Mario Sanz-Guerrero, Katharina von der Wense
cs.AI

摘要

大型语言模型(LLM)具备强大的零样本和少样本学习能力。然而在对候选选项集进行预测时,LLM容易受到标签偏差的影响,而现有校准方法往往忽视由多标记类别标签引起的偏差。我们针对标签长度偏差问题展开研究——即使经过标准长度归一化处理,不同长度的标签仍会得到不一致的对待。为缓解该问题,我们提出归一化上下文校准(NCC)方法,这种在全标签层级进行归一化与校准的有效方案,在多个数据集和模型上实现了较现有方法的统计显著提升,F1分数最高提升10%。此外,NCC还能将偏差 mitigation 拓展至多项选择题解答等更广泛任务。分析表明,当与上下文学习结合时,NCC对少样本示例选择的敏感性更低,用更少示例即可获得有竞争力的性能,并能产生更可靠的可信度估计。这些发现凸显了缓解全标签偏差对于提升基于LLM的方法的性能与鲁棒性的重要性,尤其在现实应用中类别标签天然由多标记构成的场景下。
English
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
PDF62December 1, 2025