능동 학습을 통한 효율적인 프로세스 보상 모델 학습

초록

프로세스 보상 모델(PRMs)은 대규모 언어 모델(LLMs)에 단계별 감독을 제공하지만, 학습 데이터 주석 작업의 확장은 인간과 LLMs 모두에게 여전히 도전적인 과제입니다. 이러한 한계를 해결하기 위해, 우리는 적극적으로 가장 불확실한 샘플을 선택하여 학습하는 능동 학습 접근법인 ActPRM을 제안합니다. 이 방법은 라벨링 비용을 상당히 줄여줍니다. 학습 과정에서, PRM은 순방향 전파 후 불확실성을 추정하여 매우 불확실한 데이터만을 유지합니다. 그런 다음, 비용이 많이 드는 추론 모델이 이 데이터에 라벨을 붙입니다. 이후 라벨에 대한 손실을 계산하고 PRM의 가중치를 업데이트합니다. 우리는 ActPRM과 기본 미세 조정을 풀 기반 능동 학습 설정에서 비교하여, ActPRM이 주석 작업을 50% 줄이면서도 동등하거나 더 나은 성능을 달성함을 보여줍니다. 주석 효율성 외에도, 우리는 ActPRM을 사용하여 100만 개 이상의 수학 추론 궤적을 필터링하여 데이터의 60%를 유지함으로써 능동적으로 훈련된 PRM을 더욱 발전시켰습니다. 이 선택된 데이터셋에 대한 후속 훈련은 동일한 크기의 모델과 비교하여 ProcessBench(75.0%)와 PRMBench(65.5%)에서 새로운 최첨단(SOTA) PRM을 달성했습니다.

English

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

능동 학습을 통한 효율적인 프로세스 보상 모델 학습

Efficient Process Reward Model Training via Active Learning

초록

Support