诚实对齐

摘要

最近的研究在将对齐技术应用于增强大型语言模型（LLMs）的实用性和无害性方面取得了重大进展，以符合人类意图。在本文中，我们主张对于诚实性而言，对齐的重要性在于确保LLMs在缺乏知识时能主动拒绝回答问题，同时又不过于保守。然而，对于诚实性的对齐的一个关键方面涉及识别LLMs知识的界限，这远非易事。这一挑战需要在度量开发、基准创建和训练方法方面提供全面的解决方案。在本文中，我们首先通过确立精确的问题定义，并借鉴《论语》的启示来定义“诚实性”，以此作为发展有效衡量LLMs诚实性的度量标准的基石，通过量化对齐后的进展。此外，我们引入了一个灵活的训练框架，进一步通过几种强调诚实性而不牺牲其他任务表现的高效微调技术来具体实现。我们广泛的实验显示，这些对齐模型在诚实性方面表现出显著提升，如我们提出的度量标准所示。我们开源了大量资源，以促进未来研究，包括面向诚实性对齐的模型、诚实性对齐的训练和评估数据集、概念词汇表，以及所有相关源代码，网址为https://github.com/GAIR-NLP/alignment-for-honesty。

English

Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.

Alignment for Honesty

摘要

Support