ChatPaper.aiChatPaper

自我對弈與執行反饋:提升大型語言模型的指令遵循能力

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

June 19, 2024
作者: Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
cs.AI

摘要

大型語言模型(LLMs)的一個核心能力是遵循自然語言指令。然而,如何自動構建高質量的訓練數據,以增強LLMs的複雜指令遵循能力,而無需手動標註,這個問題仍未解決。本文介紹了AutoIF,這是第一個可擴展且可靠的方法,用於自動生成指令遵循訓練數據。AutoIF將指令遵循數據質量的驗證轉化為代碼驗證,要求LLMs生成指令、相應的代碼來檢查指令回應的正確性,以及單元測試樣本來驗證代碼的正確性。然後,基於執行反饋的拒絕抽樣可以生成用於監督微調(SFT)和來自人類反饋的強化學習(RLHF)訓練的數據。當應用於頂尖開源LLMs Qwen2和LLaMA3時,AutoIF在三種訓練算法(SFT、離線DPO和在線DPO)中實現了顯著改進,並且在自對齊和強到弱蒸餾設置中表現出色。我們的代碼可以在https://github.com/QwenLM/AutoIF 上公開獲取。
English
One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Summary

AI-Generated Summary

PDF162December 2, 2024