실행 피드백을 통한 자기 대전: 대규모 언어 모델의 명령어 수행 능력 향상

초록

대규모 언어 모델(LLM)의 핵심 능력 중 하나는 자연어 지시를 따르는 것입니다. 그러나 수동 주석 없이도 LLM의 복잡한 지시 수행 능력을 향상시키기 위해 고품질의 훈련 데이터를 자동으로 구성하는 문제는 여전히 해결되지 않고 있습니다. 본 논문에서는 지시 수행 훈련 데이터를 자동으로 생성하는 최초의 확장 가능하고 신뢰할 수 있는 방법인 AutoIF를 소개합니다. AutoIF는 지시 수행 데이터의 품질 검증을 코드 검증으로 전환하여, LLM이 지시를 생성하고, 지시 응답의 정확성을 확인하는 코드를 생성하며, 코드의 정확성을 검증하기 위한 단위 테스트 샘플을 생성하도록 요구합니다. 그런 다음, 실행 피드백 기반의 거부 샘플링을 통해 지도 미세 조정(SFT)과 인간 피드백 강화 학습(RLHF) 훈련을 위한 데이터를 생성할 수 있습니다. AutoIF는 최고의 오픈소스 LLM인 Qwen2와 LLaMA3에 적용하여 자체 정렬 및 강한 모델에서 약한 모델로의 지식 증류 설정에서 SFT, 오프라인 DPO, 온라인 DPO 등 세 가지 훈련 알고리즘 전반에 걸쳐 상당한 개선을 달성했습니다. 우리의 코드는 https://github.com/QwenLM/AutoIF에서 공개되어 있습니다.

English

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

실행 피드백을 통한 자기 대전: 대규모 언어 모델의 명령어 수행 능력 향상

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

초록

Support