自我对弈与执行反馈:提高大型语言模型的指令遵循能力
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
June 19, 2024
作者: Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
cs.AI
摘要
大型语言模型(LLMs)的一个核心能力是遵循自然语言指令。然而,如何自动构建高质量的训练数据,以增强LLMs的复杂指令遵循能力,而无需手动注释,这一问题仍未解决。在本文中,我们介绍了AutoIF,这是第一个可扩展且可靠的方法,用于自动生成指令遵循训练数据。AutoIF将指令遵循数据质量的验证转化为代码验证,要求LLMs生成指令、相应的用于检查指令响应正确性的代码,以及用于验证代码正确性的单元测试样本。然后,基于执行反馈的拒绝抽样可以为监督微调(SFT)和来自人类反馈的强化学习(RLHF)训练生成数据。在应用于顶级开源LLMs Qwen2和LLaMA3时,AutoIF在自对齐和强到弱蒸馏设置中,对三种训练算法SFT、离线DPO和在线DPO都取得了显著的改进。我们的代码公开可用于https://github.com/QwenLM/AutoIF。
English
One core capability of large language models (LLMs) is to follow natural
language instructions. However, the issue of automatically constructing
high-quality training data to enhance the complex instruction-following
abilities of LLMs without manual annotation remains unresolved. In this paper,
we introduce AutoIF, the first scalable and reliable method for automatically
generating instruction-following training data. AutoIF transforms the
validation of instruction-following data quality into code verification,
requiring LLMs to generate instructions, the corresponding code to check the
correctness of the instruction responses, and unit test samples to verify the
code's correctness. Then, execution feedback-based rejection sampling can
generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from
Human Feedback (RLHF) training. AutoIF achieves significant improvements across
three training algorithms, SFT, Offline DPO, and Online DPO, when applied to
the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and
strong-to-weak distillation settings. Our code is publicly available at
https://github.com/QwenLM/AutoIF.Summary
AI-Generated Summary