基于原则的自对齐语言模型从零开始，最小化人类监督

摘要

最近的AI助手代理，如ChatGPT，主要依赖有人注释的监督微调（SFT）和来自人类反馈的强化学习（RLHF）来使大型语言模型（LLMs）的输出与人类意图保持一致，确保它们是有帮助、符合伦理和可靠的。然而，这种依赖性可能会显著限制AI助手代理的真正潜力，因为获取人类监督的高成本以及相关的质量、可靠性、多样性、自一致性和不良偏见等问题。为了解决这些挑战，我们提出了一种名为SELF-ALIGN的新方法，结合基于原则的推理和LLMs的生成能力，实现AI代理的自我调整，减少人类监督。我们的方法包括四个阶段：首先，我们使用LLM生成合成提示，并使用主题引导方法增加提示的多样性；其次，我们使用一小组人类编写的原则供AI模型遵循，并通过上下文学习（原则应用的演示）引导LLM生成对用户查询有帮助、符合伦理和可靠的响应；第三，我们使用高质量的自我调整响应对原始LLM进行微调，使得生成的模型可以直接为每个查询生成理想的响应，不再需要原则集和演示；最后，我们提供一个细化步骤来解决过于简短或间接的响应问题。将SELF-ALIGN应用于LLaMA-65b基础语言模型，我们开发了一个名为Dromedary的AI助手。仅使用不到300行的人类注释（包括<200个种子提示、16个通用原则和5个用于上下文学习的示例），Dromedary在各种设置的基准数据集上显著超越了几种最先进的AI系统的性能，包括Text-Davinci-003和Alpaca。

English

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

基于原则的自对齐语言模型从零开始，最小化人类监督

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

摘要

Support