사전 프롬프트 엔지니어링을 통한 강화 학습 미세 조정

초록

본 논문은 강화 미세 조정(Reinforcement Fine-Tuning, RFT)의 맥락에서 사전 프롬프트 엔지니어링(Prior Prompt Engineering, pPE)을 탐구한다. 여기서 언어 모델(Language Models, LMs)은 보상 신호를 통해 성능을 극대화하는 행동을 보이도록 유도된다. 기존 RFT 연구가 주로 알고리즘, 보상 형성, 데이터 큐레이션에 초점을 맞춘 반면, 훈련 중 질의 앞에 추가되어 단계별 추론과 같은 행동을 이끌어내는 사전 프롬프트의 설계는 충분히 탐구되지 않았다. 우리는 서로 다른 pPE 접근법이 RFT 이후에 언어 모델이 구별되는 행동을 내면화하도록 이끌 수 있는지 조사한다. 추론 시점 프롬프트 엔지니어링(Inference-time Prompt Engineering, iPE)에서 영감을 받아, 우리는 추론, 계획, 코드 기반 추론, 지식 회상, 널 예시 활용이라는 다섯 가지 대표적인 iPE 전략을 해당 pPE 접근법으로 변환한다. 각 pPE 접근법을 사용하여 Qwen2.5-7B 모델을 실험한 후, 인-도메인 및 아웃-오브-도메인 벤치마크(예: AIME2024, HumanEval+, GPQA-Diamond)에서 성능을 평가한다. 실험 결과, 모든 pPE로 훈련된 모델이 iPE 프롬프트를 사용한 모델을 능가했으며, 널 예시 pPE 접근법이 가장 큰 평균 성능 향상을 보였고 AIME2024와 GPQA-Diamond에서 가장 높은 개선을 달성하여 일반적으로 사용되는 추론 접근법을 능가했다. 또한, 행동 분류 프레임워크를 적용하여 서로 다른 pPE 전략이 결과 모델에 구별되는 행동 스타일을 심어준다는 것을 입증했다. 이러한 발견들은 pPE가 RFT에서 강력하면서도 충분히 연구되지 않은 축으로 자리매김할 수 있음을 보여준다.

English

This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.

사전 프롬프트 엔지니어링을 통한 강화 학습 미세 조정

Prior Prompt Engineering for Reinforcement Fine-Tuning

초록

Support