誠實對齊
Alignment for Honesty
December 12, 2023
作者: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu
cs.AI
摘要
最近的研究在將對齊技術應用於增強大型語言模型(LLMs)的幫助性和無害性方面取得了顯著進展,以符合人類意圖。在本文中,我們主張對於誠實性而言,對齊的重要性在於確保LLMs在缺乏知識時主動拒絕回答問題,同時又不過於保守。然而,對於誠實性的對齊的一個關鍵方面涉及識別LLM知識的界限,這遠非易事。這一挑戰需要在度量開發、基準創建和訓練方法論方面提供全面的解決方案。在本文中,我們通過首先確立精確的問題定義並根據《論語》中的啟發來定義“誠實”,來應對這些挑戰。這為通過量化對齊後LLM的進展來有效衡量其誠實性的指標的發展奠定了基礎。此外,我們引入了一個靈活的訓練框架,進一步通過幾種強調誠實性而不犧牲其他任務表現的高效微調技術來實現。我們的廣泛實驗顯示,這些對齊模型在誠實性方面表現出明顯提高,這是由我們提出的指標所顯示的。我們開源了大量資源以促進未來研究,包括誠實對齊模型、用於誠實對齊的訓練和評估數據集、概念詞彙表,以及所有相關源代碼,網址為https://github.com/GAIR-NLP/alignment-for-honesty。
English
Recent research has made significant strides in applying alignment techniques
to enhance the helpfulness and harmlessness of large language models (LLMs) in
accordance with human intentions. In this paper, we argue for the importance of
alignment for honesty, ensuring that LLMs proactively refuse to answer
questions when they lack knowledge, while still not being overly conservative.
However, a pivotal aspect of alignment for honesty involves discerning the
limits of an LLM's knowledge, which is far from straightforward. This challenge
demands comprehensive solutions in terms of metric development, benchmark
creation, and training methodologies. In this paper, we address these
challenges by first establishing a precise problem definition and defining
``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone
for developing metrics that effectively measure an LLM's honesty by quantifying
its progress post-alignment. Furthermore, we introduce a flexible training
framework which is further instantiated by several efficient fine-tuning
techniques that emphasize honesty without sacrificing performance on other
tasks. Our extensive experiments reveal that these aligned models show a marked
increase in honesty, as indicated by our proposed metrics. We open-source a
wealth of resources to facilitate future research at
https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned
models, training and evaluation datasets for honesty alignment, concept
glossary, as well as all relevant source code.