정직성을 위한 정렬

초록

최근 연구에서는 인간의 의도에 부합하도록 대형 언어 모델(LLM)의 유용성과 무해성을 강화하기 위해 정렬(alignment) 기술을 적용하는 데 있어 상당한 진전을 이루었습니다. 본 논문에서는 정직성을 위한 정렬의 중요성을 주장하며, LLM이 지식이 부족한 질문에 대해 적극적으로 답변을 거부하면서도 지나치게 보수적이지 않도록 하는 것이 필요하다고 논의합니다. 그러나 정직성을 위한 정렬의 핵심적인 측면은 LLM의 지식 한계를 파악하는 것인데, 이는 결코 간단하지 않은 문제입니다. 이러한 도전은 지표 개발, 벤치마크 생성, 훈련 방법론 등 포괄적인 해결책을 요구합니다. 본 논문에서는 이러한 도전 과제를 해결하기 위해 먼저 정확한 문제 정의를 설정하고, 공자어록에서 영감을 받아 '정직성'을 정의합니다. 이는 정렬 후 LLM의 진전을 정량화하여 정직성을 효과적으로 측정하는 지표 개발의 초석 역할을 합니다. 또한, 다른 작업의 성능을 희생하지 않으면서 정직성을 강조하는 여러 효율적인 미세 조정(fine-tuning) 기법으로 구체화된 유연한 훈련 프레임워크를 소개합니다. 우리의 광범위한 실험은 제안된 지표에 따라 정렬된 모델이 정직성이 크게 증가했음을 보여줍니다. 우리는 https://github.com/GAIR-NLP/alignment-for-honesty에서 정직성 정렬 모델, 정직성 정렬을 위한 훈련 및 평가 데이터셋, 개념 용어집, 그리고 관련 소스 코드를 포함한 풍부한 리소스를 공개하여 향후 연구를 촉진합니다.

English

Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.

정직성을 위한 정렬

Alignment for Honesty

초록

Support