正直さのためのアラインメント

要旨

近年の研究では、人間の意図に沿って大規模言語モデル（LLM）の有用性と無害性を向上させるためのアライメント技術の適用が大きく進展しています。本論文では、LLMが知識を欠く場合に積極的に質問に答えないようにする一方で、過度に保守的にならないよう、誠実さのためのアライメントの重要性を主張します。しかし、誠実さのためのアライメントの重要な側面は、LLMの知識の限界を見極めることであり、これは決して単純な課題ではありません。この課題は、メトリクスの開発、ベンチマークの作成、トレーニング手法の観点から包括的な解決策を必要とします。本論文では、まず正確な問題定義を確立し、『論語』に着想を得た「誠実さ」を定義することで、これらの課題に取り組みます。これは、アライメント後の進捗を定量化することでLLMの誠実さを効果的に測定するメトリクスを開発するための基盤となります。さらに、他のタスクの性能を犠牲にすることなく誠実さを重視する、いくつかの効率的なファインチューニング技術によって具体化される柔軟なトレーニングフレームワークを導入します。私たちの広範な実験は、提案されたメトリクスによって示されるように、これらのアライメントされたモデルが誠実さの大幅な向上を示すことを明らかにしています。今後の研究を促進するため、https://github.com/GAIR-NLP/alignment-for-honesty にて、誠実さにアライメントされたモデル、誠実さアライメントのためのトレーニングおよび評価データセット、概念用語集、および関連するすべてのソースコードを含む豊富なリソースをオープンソースとして公開しています。

English

Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.

正直さのためのアラインメント

Alignment for Honesty

要旨

Support