註解高效的普適誠信對齊
Annotation-Efficient Universal Honesty Alignment
October 20, 2025
作者: Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng
cs.AI
摘要
誠實對齊——大型語言模型(LLMs)識別其知識邊界並表達校準信心的能力——對於可信賴的部署至關重要。現有方法要么依賴於無訓練的信心估計(例如,詞元概率、自我一致性),要么依賴於基於訓練的校準與正確性註釋。雖然有效,但實現基於訓練校準的普遍誠實對齊需要昂貴的大規模標註。為了支持高效註釋的訓練,我們引入了Elicitation-Then-Calibration(EliCal),這是一個兩階段框架,首先利用廉價的自我一致性監督引出內部信心,然後用一小組正確性註釋校準這一信心。為了支持大規模研究,我們發布了HonestyBench,這是一個涵蓋十個自由形式問答數據集的基準,包含560k訓練和70k評估實例,並附有正確性和自我一致性信號的註釋。實驗表明,EliCal僅需1k正確性註釋(全監督的0.18%)即可實現接近最優的對齊,並在未見的MMLU任務上比僅校準的基線表現出更好的對齊性能,為LLMs中的普遍誠實對齊提供了一個可擴展的解決方案。
English
Honesty alignment-the ability of large language models (LLMs) to recognize
their knowledge boundaries and express calibrated confidence-is essential for
trustworthy deployment. Existing methods either rely on training-free
confidence estimation (e.g., token probabilities, self-consistency) or
training-based calibration with correctness annotations. While effective,
achieving universal honesty alignment with training-based calibration requires
costly, large-scale labeling. To support annotation-efficient training, we
introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that
first elicits internal confidence using inexpensive self-consistency
supervision, then calibrates this confidence with a small set of correctness
annotations. To support a large-scale study, we release HonestyBench, a
benchmark covering ten free-form QA datasets with 560k training and 70k
evaluation instances annotated with correctness and self-consistency signals.
Experiments show that EliCal achieves near-optimal alignment with only 1k
correctness annotations (0.18% of full supervision) and better alignment
performance on unseen MMLU tasks than the calibration-only baseline, offering a
scalable solution toward universal honesty alignment in LLMs.