從零開始以最少人類監督進行基於原則的語言模型自我對齊

摘要

近期的 AI 助理代理，如 ChatGPT，主要依賴監督微調（SFT）與來自人類反饋的強化學習（RLHF），以調整大型語言模型（LLMs）的輸出與人類意圖一致，確保它們具有幫助性、道德性和可靠性。然而，這種依賴性可能會嚴重限制 AI 助理代理的真正潛力，因為獲取人類監督的高成本以及相關的質量、可靠性、多樣性、自一致性和不良偏見問題。為應對這些挑戰，我們提出了一種名為 SELF-ALIGN 的新方法，結合基於原則的推理和LLMs的生成能力，實現AI代理的自我調整，並最小化人類監督。我們的方法包括四個階段：首先，我們使用LLM生成合成提示，並使用主題導向方法來增加提示的多樣性；其次，我們使用少量人類編寫的原則供AI模型遵循，並通過在上下文中從示範（原則應用）中引導LLM，以對用戶查詢產生幫助性、道德性和可靠性的回應；第三，我們使用高質量的自我調整回應對原始LLM進行微調，使結果模型能夠直接為每個查詢生成理想的回應，而無需再使用原則集和示範；最後，我們提供一個改進步驟來解決過於簡短或間接回應的問題。將SELF-ALIGN應用於LLaMA-65b基礎語言模型，我們開發了一個名為Dromedary的AI助理。僅使用不到300行人類標註（包括<200個種子提示、16個通用原則和5個示例進行上下文學習），Dromedary在各種設置的基準數據集上顯著超越了幾個最先進的AI系統，包括Text-Davinci-003和Alpaca。

English

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

從零開始以最少人類監督進行基於原則的語言模型自我對齊

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

摘要

Support