可信的LLM：大型语言模型对齐评估的调查和指南

摘要

确保对齐，指的是使模型与人类意图一致[1,2]，在将大型语言模型（LLMs）部署到实际应用之前，这已成为一项关键任务。例如，OpenAI在发布GPT-4之前花费了六个月来迭代对齐[3]。然而，从业者面临的一个主要挑战是缺乏明确指导，以评估LLM输出是否与社会规范、价值观和法规一致。这一障碍阻碍了LLMs的系统迭代和部署。为了解决这个问题，本文提出了一个关键维度的综合调查，这些维度在评估LLM的可信度时至关重要。调查涵盖了LLM可信度的七个主要类别：可靠性、安全性、公平性、防止滥用、可解释性和推理能力、遵守社会规范以及鲁棒性。每个主要类别进一步分为多个子类别，共计29个子类别。此外，选择了8个子类别进行进一步调查，设计并实施了针对几种广泛使用的LLMs的相应测量研究。测量结果表明，一般来说，更加对齐的模型在整体可信度方面表现更好。然而，对齐的有效性在考虑的不同可信度类别之间存在差异。这突显了进行更加细致的分析、测试和对LLM对齐进行持续改进的重要性。通过阐明LLM可信度的这些关键维度，本文旨在为该领域的从业者提供有价值的见解和指导。理解和解决这些问题将对在各种应用中实现LLMs的可靠和符合道德的部署至关重要。

English

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

可信的LLM：大型语言模型对齐评估的调查和指南

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

摘要

Support