가중 상호정보량 데이터 선택을 통한 효율적인 RLVR 훈련

초록

강화 학습(RL)은 대규모 언어 모델의 추론 능력 및 정렬(alignment) 향상에 핵심적인 역할을 하지만, 그 효율성은 학습 데이터 선택 방식에 크게 좌우됩니다. 기존의 온라인 선택 전략은 주로 난이도 기반 휴리스틱에 의존하여 중간 정도의 성공률을 보이는 데이터 포인트를 선호하는데, 이는 난이도를 정보성(informativeness)과 암묵적으로 동일시하고 제한된 증거에서 비롯되는 인식적 불확실성(epistemic uncertainty)을 간과합니다. 본 연구에서는 가중 상호 정보량 목적 함수에 기반한 RL 훈련용 정보 기반 데이터 샘플링 방법인 InSight를 소개합니다. 베이지안 잠재 성공률을 통해 데이터 결과를 모델링함으로써, 기대 불확실성 감소가 상호 보완적인 난이도 및 증거 의존 구성 요소로 분해됨을 보여주어 난이도만을 고려한 선택의 근본적 한계를 밝힙니다. 이러한 관찰을 활용하여 InSight는 노이즈가 포함된 표본 결과 대신 데이터 포인트의 성공에 대한 평균 신념(belief)을 기반으로 안정적인 획득 점수를 구성하며, 검증 가능한 보상을 활용한 강화 학습(RLVR)에서 흔한 다중 롤아웃(multi-rollout) 설정으로 자연스럽게 확장됩니다. 대규모 실험을 통해 InSight가 최첨단 성능을 꾸준히 달성하고 훈련 효율을 향상시키며, Planning & Mathematics 벤치마크에서 평균 +1.41점, 일반 추론에서 +1.01점 향상, 최대 약 2.2배의 가속화 효과를 거의 무시할 수 있는 추가 계산 오버헤드로 달성함을 입증합니다.

English

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.

가중 상호정보량 데이터 선택을 통한 효율적인 RLVR 훈련

Efficient RLVR Training via Weighted Mutual Information Data Selection

초록

Support