ChatPaper.aiChatPaper

缓解大型语言模型中的标签长度偏差

Mitigating Label Length Bias in Large Language Models

November 18, 2025
作者: Mario Sanz-Guerrero, Katharina von der Wense
cs.AI

摘要

大型语言模型(LLM)是强大的零样本与少样本学习者。然而在对候选选项集进行预测时,LLM存在标签偏差问题,现有校准方法未能解决多标记类别标签引发的偏差。我们针对标签长度偏差现象展开研究——即使经过标准长度归一化处理,不同长度的标签仍会被不一致地对待。为缓解该问题,我们提出归一化上下文校准(NCC)方法,通过在全标签层面进行归一化与校准来提升预测效果。在多个数据集和模型上的实验表明,NCC相较现有方法实现了统计显著提升,F1分数最高增长10%。此外,NCC还能将偏差缓解能力拓展至多项选择题解答等更广泛任务。分析显示,结合上下文学习时,NCC对少样本示例选择的敏感性更低,用更少示例即可获得竞争优势,并能生成更可靠的可信度估计。这些发现表明,缓解全标签偏差对提升基于LLM方法的性能与鲁棒性具有重要意义,尤其在现实应用中类别标签天然由多标记构成的场景下。
English
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
PDF62December 1, 2025