ChatPaper.aiChatPaper

标注高效的通用诚实对齐

Annotation-Efficient Universal Honesty Alignment

October 20, 2025
作者: Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng
cs.AI

摘要

诚实对齐——大型语言模型(LLMs)识别其知识边界并表达校准后置信度的能力——对于可信部署至关重要。现有方法要么依赖于无需训练的置信度估计(如标记概率、自一致性),要么依赖于带有正确性标注的训练校准。尽管有效,但通过训练校准实现普遍诚实对齐需要昂贵的大规模标注。为支持高效标注训练,我们提出了启发后校准(EliCal)这一两阶段框架,首先利用低成本的自一致性监督启发内部置信度,随后使用少量正确性标注对此置信度进行校准。为支持大规模研究,我们发布了HonestyBench基准,涵盖十个自由形式问答数据集,包含56万训练实例和7万评估实例,均标注了正确性和自一致性信号。实验表明,EliCal仅需1千个正确性标注(全监督的0.18%)即可实现接近最优的对齐,并在未见过的MMLU任务上展现出优于仅校准基线的对齐性能,为LLMs的普遍诚实对齐提供了可扩展的解决方案。
English
Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
PDF192October 21, 2025