REFINE-AF：一种任务无关框架，通过自动反馈强化学习利用自生成指令对齐语言模型

摘要

基于指令的大型语言模型（LLMs）在众多少样本或零样本自然语言处理（NLP）任务中已展现出显著成效。然而，人工标注指令数据不仅耗时、成本高昂，且在数量和任务多样性上往往受限。先前的研究尝试通过提出能够直接从模型本身以半自动化、任务无关的方式生成指令的框架来应对这一挑战。这些努力大多依赖于如GPT-3.5（175B）这样的大型仅API参数模型，这些模型不仅昂贵，还受到查询次数限制。本文探讨了三种开源小型LLMs——LLaMA 2-7B、LLaMA 2-13B和Mistral 7B，在采用半自动化框架下的表现，从而减少了为微调LLMs生成指令数据集所需的人力干预、努力及成本。此外，我们展示了将基于强化学习（RL）的训练算法融入这一LLMs框架后，能带来进一步的性能提升。对数据集的评估表明，相较于以往方法，这些基于RL的框架在63%至66%的任务中实现了显著改进。

English

Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.

REFINE-AF：一种任务无关框架，通过自动反馈强化学习利用自生成指令对齐语言模型

REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

摘要

Support