自我对弈与执行反馈：提高大型语言模型的指令遵循能力

摘要

大型语言模型（LLMs）的一个核心能力是遵循自然语言指令。然而，如何自动构建高质量的训练数据，以增强LLMs的复杂指令遵循能力，而无需手动注释，这一问题仍未解决。在本文中，我们介绍了AutoIF，这是第一个可扩展且可靠的方法，用于自动生成指令遵循训练数据。AutoIF将指令遵循数据质量的验证转化为代码验证，要求LLMs生成指令、相应的用于检查指令响应正确性的代码，以及用于验证代码正确性的单元测试样本。然后，基于执行反馈的拒绝抽样可以为监督微调（SFT）和来自人类反馈的强化学习（RLHF）训练生成数据。在应用于顶级开源LLMs Qwen2和LLaMA3时，AutoIF在自对齐和强到弱蒸馏设置中，对三种训练算法SFT、离线DPO和在线DPO都取得了显著的改进。我们的代码公开可用于https://github.com/QwenLM/AutoIF。

English

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

自我对弈与执行反馈：提高大型语言模型的指令遵循能力

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

摘要

Support