值得信賴的LLM：對評估大型語言模型對齊性的調查和指南

摘要

確保對齊，指的是使模型符合人類意圖的行為[1,2]，在將大型語言模型（LLMs）應用於現實應用之前，已成為一項至關重要的任務。例如，OpenAI在發布GPT-4之前花了六個月來逐步對齊模型[3]。然而，從業者面臨的一個主要挑戰是缺乏關於評估LLM輸出是否符合社會規範、價值觀和法規的明確指導。這一障礙阻礙了LLMs的系統迭代和部署。為了解決這個問題，本文提出了一項關於評估LLM可信度時必須考慮的關鍵維度的全面調查。該調查涵蓋了LLM可信度的七個主要類別：可靠性、安全性、公平性、防止誤用、可解釋性和推理、遵守社會規範以及韌性。每個主要類別進一步細分為幾個子類別，總共有29個子類別。此外，選擇了8個子類別的子集進行進一步研究，設計並對幾個廣泛使用的LLMs進行相應的測量研究。測量結果顯示，一般來說，更加對齊的模型在整體可信度方面表現更好。然而，對齊的有效性在考慮的不同可信度類別之間存在差異。這凸顯了進行更加細緻的分析、測試和對LLM對齊進行持續改進的重要性。通過闡明LLM可信度的這些關鍵維度，本文旨在為該領域的從業者提供有價值的見解和指導。了解並解決這些問題將對在各種應用中實現可靠且符合道德的LLMs部署至關重要。

English

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

值得信賴的LLM：對評估大型語言模型對齊性的調查和指南

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

摘要

Support