REFINE-AF:一种任务无关框架,通过自动反馈强化学习利用自生成指令对齐语言模型
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
May 10, 2025
作者: Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal
cs.AI
摘要
基于指令的大型语言模型(LLMs)在众多少样本或零样本自然语言处理(NLP)任务中已展现出显著成效。然而,人工标注指令数据不仅耗时、成本高昂,且在数量和任务多样性上往往受限。先前的研究尝试通过提出能够直接从模型本身以半自动化、任务无关的方式生成指令的框架来应对这一挑战。这些努力大多依赖于如GPT-3.5(175B)这样的大型仅API参数模型,这些模型不仅昂贵,还受到查询次数限制。本文探讨了三种开源小型LLMs——LLaMA 2-7B、LLaMA 2-13B和Mistral 7B,在采用半自动化框架下的表现,从而减少了为微调LLMs生成指令数据集所需的人力干预、努力及成本。此外,我们展示了将基于强化学习(RL)的训练算法融入这一LLMs框架后,能带来进一步的性能提升。对数据集的评估表明,相较于以往方法,这些基于RL的框架在63%至66%的任务中实现了显著改进。
English
Instruction-based Large Language Models (LLMs) have proven effective in
numerous few-shot or zero-shot Natural Language Processing (NLP) tasks.
However, creating human-annotated instruction data is time-consuming,
expensive, and often limited in quantity and task diversity. Previous research
endeavors have attempted to address this challenge by proposing frameworks
capable of generating instructions in a semi-automated and task-agnostic manner
directly from the model itself. Many of these efforts have relied on large
API-only parameter-based models such as GPT-3.5 (175B), which are expensive,
and subject to limits on a number of queries. This paper explores the
performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B,
and Mistral 7B, using a semi-automated framework, thereby reducing human
intervention, effort, and cost required to generate an instruction dataset for
fine-tuning LLMs. Furthermore, we demonstrate that incorporating a
Reinforcement Learning (RL) based training algorithm into this LLMs-based
framework leads to further enhancements. Our evaluation of the dataset reveals
that these RL-based frameworks achieve a substantial improvements in 63-66% of
the tasks compared to previous approaches.Summary
AI-Generated Summary