基于原则的自对齐语言模型从零开始,最小化人类监督
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
May 4, 2023
作者: Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
cs.AI
摘要
最近的AI助手代理,如ChatGPT,主要依赖有人注释的监督微调(SFT)和来自人类反馈的强化学习(RLHF)来使大型语言模型(LLMs)的输出与人类意图保持一致,确保它们是有帮助、符合伦理和可靠的。然而,这种依赖性可能会显著限制AI助手代理的真正潜力,因为获取人类监督的高成本以及相关的质量、可靠性、多样性、自一致性和不良偏见等问题。为了解决这些挑战,我们提出了一种名为SELF-ALIGN的新方法,结合基于原则的推理和LLMs的生成能力,实现AI代理的自我调整,减少人类监督。我们的方法包括四个阶段:首先,我们使用LLM生成合成提示,并使用主题引导方法增加提示的多样性;其次,我们使用一小组人类编写的原则供AI模型遵循,并通过上下文学习(原则应用的演示)引导LLM生成对用户查询有帮助、符合伦理和可靠的响应;第三,我们使用高质量的自我调整响应对原始LLM进行微调,使得生成的模型可以直接为每个查询生成理想的响应,不再需要原则集和演示;最后,我们提供一个细化步骤来解决过于简短或间接的响应问题。将SELF-ALIGN应用于LLaMA-65b基础语言模型,我们开发了一个名为Dromedary的AI助手。仅使用不到300行的人类注释(包括<200个种子提示、16个通用原则和5个用于上下文学习的示例),Dromedary在各种设置的基准数据集上显著超越了几种最先进的AI系统的性能,包括Text-Davinci-003和Alpaca。
English
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised
fine-tuning (SFT) with human annotations and reinforcement learning from human
feedback (RLHF) to align the output of large language models (LLMs) with human
intentions, ensuring they are helpful, ethical, and reliable. However, this
dependence can significantly constrain the true potential of AI-assistant
agents due to the high cost of obtaining human supervision and the related
issues on quality, reliability, diversity, self-consistency, and undesirable
biases. To address these challenges, we propose a novel approach called
SELF-ALIGN, which combines principle-driven reasoning and the generative power
of LLMs for the self-alignment of AI agents with minimal human supervision. Our
approach encompasses four stages: first, we use an LLM to generate synthetic
prompts, and a topic-guided method to augment the prompt diversity; second, we
use a small set of human-written principles for AI models to follow, and guide
the LLM through in-context learning from demonstrations (of principles
application) to produce helpful, ethical, and reliable responses to user's
queries; third, we fine-tune the original LLM with the high-quality
self-aligned responses so that the resulting model can generate desirable
responses for each query directly without the principle set and the
demonstrations anymore; and finally, we offer a refinement step to address the
issues of overly-brief or indirect responses. Applying SELF-ALIGN to the
LLaMA-65b base language model, we develop an AI assistant named Dromedary. With
fewer than 300 lines of human annotations (including < 200 seed prompts, 16
generic principles, and 5 exemplars for in-context learning). Dromedary
significantly surpasses the performance of several state-of-the-art AI systems,
including Text-Davinci-003 and Alpaca, on benchmark datasets with various
settings.