强化微调的先验提示工程
Prior Prompt Engineering for Reinforcement Fine-Tuning
May 20, 2025
作者: Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul
cs.AI
摘要
本文探讨了在强化微调(RFT)背景下的先验提示工程(pPE),其中语言模型(LMs)通过奖励信号被激励以展现最大化性能的行为。尽管现有的RFT研究主要集中于算法、奖励塑造和数据筛选,但在训练期间附加于查询之前以引导行为(如逐步推理)的先验提示设计仍未被充分研究。我们研究了不同的pPE方法是否能在RFT后引导LMs内化不同的行为。受推理时提示工程(iPE)启发,我们将五种代表性的iPE策略——推理、规划、基于代码的推理、知识回忆及空例利用——转化为相应的pPE方法。我们使用Qwen2.5-7B模型对每种pPE方法进行实验,并在领域内和领域外基准测试(如AIME2024、HumanEval+和GPQA-Diamond)上评估性能。结果显示,所有经过pPE训练的模型均优于其iPE提示的对应模型,其中空例pPE方法实现了最大的平均性能提升,并在AIME2024和GPQA-Diamond上取得了最高改进,超越了常用的推理方法。此外,通过采用行为分类框架,我们证明了不同的pPE策略在最终模型中植入了不同的行为风格。这些发现将pPE定位为RFT中一个强大但尚未被充分研究的维度。
English
This paper investigates prior prompt engineering (pPE) in the context of
reinforcement fine-tuning (RFT), where language models (LMs) are incentivized
to exhibit behaviors that maximize performance through reward signals. While
existing RFT research has primarily focused on algorithms, reward shaping, and
data curation, the design of the prior prompt--the instructions prepended to
queries during training to elicit behaviors such as step-by-step
reasoning--remains underexplored. We investigate whether different pPE
approaches can guide LMs to internalize distinct behaviors after RFT. Inspired
by inference-time prompt engineering (iPE), we translate five representative
iPE strategies--reasoning, planning, code-based reasoning, knowledge recall,
and null-example utilization--into corresponding pPE approaches. We experiment
with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on
in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and
GPQA-Diamond). Our results show that all pPE-trained models surpass their
iPE-prompted counterparts, with the null-example pPE approach achieving the
largest average performance gain and the highest improvement on AIME2024 and
GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by
adapting a behavior-classification framework, we demonstrate that different pPE
strategies instill distinct behavioral styles in the resulting models. These
findings position pPE as a powerful yet understudied axis for RFT.Summary
AI-Generated Summary