소규모 일반화 가능 프롬프트 예측 모델이 대규모 추론 모델의 효율적 사후 강화학습을 주도할 수 있다

초록

강화 학습은 대규모 언어 모델의 추론 능력을 향상시키지만, 롤아웃 집약적 최적화로 인해 높은 계산 비용이 수반되는 경우가 많습니다. 온라인 프롬프트 선택은 정보성이 높은 프롬프트에 우선순위를 부여하여 훈련 효율성을 개선하는 타당한 해결책을 제시합니다. 그러나 기존 방법들은 비용이 많이 드는 정확한 평가에 의존하거나, 프롬프트 간 일반화가 부족한 프롬프트 특정 예측 모델을 구축하는 한계가 있습니다. 본 연구는 공유된 최적화 이력으로 훈련된 경량 생성 모델을 사용하여 프롬프트 난이도에 대한 베이즈 추론을 수행하는 일반화 가능 예측 프롬프트 선택(GPS)을 소개합니다. 중간 난이도 우선순위 지정과 이력 기반 다양성 배치 획득 원칙에 통합되어 정보성이 높은 프롬프트 배치를 선택합니다. 소형 예측 모델은 효율적인 계산 자원 할당을 위한 테스트 시점 일반화 기능도 제공합니다. 다양한 추론 벤치마크에서의 실험 결과, GPS가 우수한 기준 방법 대비 훈련 효율성, 최종 성능, 테스트 시점 효율성에서 상당한 향상을 보여줍니다.

English

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

소규모 일반화 가능 프롬프트 예측 모델이 대규모 추론 모델의 효율적 사후 강화학습을 주도할 수 있다

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

초록

Support