誠實對齊

摘要

最近的研究在將對齊技術應用於增強大型語言模型（LLMs）的幫助性和無害性方面取得了顯著進展，以符合人類意圖。在本文中，我們主張對於誠實性而言，對齊的重要性在於確保LLMs在缺乏知識時主動拒絕回答問題，同時又不過於保守。然而，對於誠實性的對齊的一個關鍵方面涉及識別LLM知識的界限，這遠非易事。這一挑戰需要在度量開發、基準創建和訓練方法論方面提供全面的解決方案。在本文中，我們通過首先確立精確的問題定義並根據《論語》中的啟發來定義“誠實”，來應對這些挑戰。這為通過量化對齊後LLM的進展來有效衡量其誠實性的指標的發展奠定了基礎。此外，我們引入了一個靈活的訓練框架，進一步通過幾種強調誠實性而不犧牲其他任務表現的高效微調技術來實現。我們的廣泛實驗顯示，這些對齊模型在誠實性方面表現出明顯提高，這是由我們提出的指標所顯示的。我們開源了大量資源以促進未來研究，包括誠實對齊模型、用於誠實對齊的訓練和評估數據集、概念詞彙表，以及所有相關源代碼，網址為https://github.com/GAIR-NLP/alignment-for-honesty。

English

Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.

Alignment for Honesty

摘要

Support