신뢰할 수 있는 대형 언어 모델: 대형 언어 모델의 정렬 평가를 위한 조사 및 가이드라인

초록

인간의 의도에 부합하도록 모델의 행동을 조정하는 것을 의미하는 '얼라인먼트(alignment)'를 보장하는 것은 대규모 언어 모델(LLM)을 실제 애플리케이션에 배포하기 전에 필수적인 과제가 되었습니다. 예를 들어, OpenAI는 GPT-4를 출시하기 전에 6개월 동안 반복적으로 얼라인먼트 작업을 수행했습니다. 그러나 실무자들이 직면한 주요 과제는 LLM의 출력이 사회적 규범, 가치 및 규제에 부합하는지 평가하기 위한 명확한 지침이 부족하다는 점입니다. 이러한 장애물은 LLM의 체계적인 반복 및 배포를 방해합니다. 이 문제를 해결하기 위해, 본 논문은 LLM의 신뢰성을 평가할 때 고려해야 할 핵심 차원에 대한 포괄적인 조사를 제시합니다. 이 조사는 LLM 신뢰성의 7가지 주요 범주를 다룹니다: 신뢰성, 안전성, 공정성, 오용 방지, 설명 가능성 및 추론, 사회적 규범 준수, 그리고 견고성. 각 주요 범주는 여러 하위 범주로 더 세분화되어 총 29개의 하위 범주로 구성됩니다. 또한, 8개의 하위 범주를 추가 조사를 위해 선정하고, 여러 널리 사용되는 LLM에 대해 해당 측정 연구를 설계 및 수행했습니다. 측정 결과는 일반적으로 더 잘 얼라인된 모델이 전반적인 신뢰성 측면에서 더 나은 성능을 보이는 경향이 있음을 나타냅니다. 그러나 얼라인먼트의 효과는 고려된 다양한 신뢰성 범주에 따라 다릅니다. 이는 LLM 얼라인먼트에 대해 더 세분화된 분석, 테스트 및 지속적인 개선을 수행하는 것의 중요성을 강조합니다. 본 논문은 LLM 신뢰성의 이러한 핵심 차원을 조명함으로써, 해당 분야의 실무자들에게 유용한 통찰과 지침을 제공하고자 합니다. 이러한 문제를 이해하고 해결하는 것은 다양한 애플리케이션에서 신뢰할 수 있고 윤리적으로 건전한 LLM 배포를 달성하는 데 중요할 것입니다.

English

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

신뢰할 수 있는 대형 언어 모델: 대형 언어 모델의 정렬 평가를 위한 조사 및 가이드라인

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

초록

Support