強化微調的先驗提示工程
Prior Prompt Engineering for Reinforcement Fine-Tuning
May 20, 2025
作者: Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul
cs.AI
摘要
本研究探討了在強化微調(RFT)背景下的先驗提示工程(pPE),其中語言模型(LMs)通過獎勵信號被激勵以展現最大化性能的行為。儘管現有的RFT研究主要集中於算法、獎勵塑造和數據策展,但在訓練期間附加於查詢以引導行為(如逐步推理)的先驗提示設計仍未被充分探索。我們研究不同的pPE方法是否能引導LMs在RFT後內化不同的行為。受推理時提示工程(iPE)的啟發,我們將五種代表性的iPE策略——推理、規劃、基於代碼的推理、知識回憶和空例利用——轉化為相應的pPE方法。我們使用Qwen2.5-7B模型對每種pPE方法進行實驗,並在領域內和領域外基準(如AIME2024、HumanEval+和GPQA-Diamond)上評估性能。結果顯示,所有經過pPE訓練的模型均超越了其iPE提示的對應模型,其中空例pPE方法實現了最大的平均性能提升,並在AIME2024和GPQA-Diamond上取得了最高的改進,超越了常用的推理方法。此外,通過適應行為分類框架,我們展示了不同的pPE策略在最終模型中灌輸了不同的行為風格。這些發現將pPE定位為RFT中一個強大但未被充分研究的方向。
English
This paper investigates prior prompt engineering (pPE) in the context of
reinforcement fine-tuning (RFT), where language models (LMs) are incentivized
to exhibit behaviors that maximize performance through reward signals. While
existing RFT research has primarily focused on algorithms, reward shaping, and
data curation, the design of the prior prompt--the instructions prepended to
queries during training to elicit behaviors such as step-by-step
reasoning--remains underexplored. We investigate whether different pPE
approaches can guide LMs to internalize distinct behaviors after RFT. Inspired
by inference-time prompt engineering (iPE), we translate five representative
iPE strategies--reasoning, planning, code-based reasoning, knowledge recall,
and null-example utilization--into corresponding pPE approaches. We experiment
with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on
in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and
GPQA-Diamond). Our results show that all pPE-trained models surpass their
iPE-prompted counterparts, with the null-example pPE approach achieving the
largest average performance gain and the highest improvement on AIME2024 and
GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by
adapting a behavior-classification framework, we demonstrate that different pPE
strategies instill distinct behavioral styles in the resulting models. These
findings position pPE as a powerful yet understudied axis for RFT.Summary
AI-Generated Summary