LLM은 노이즈가 포함된 감독 하에서도 강건하게 추론하는 방법을 학습할 수 있을까요?

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 풍부한 정확한 레이블에 의존하는 추론 모델을 효과적으로 훈련시키지만, 전문가 부족으로 인해 불가피하게 발생하는 노이즈 레이블에 대한 취약성은 여전히 심각하게 탐구되지 않고 있다. 본 연구에서는 RLVR의 노이즈 레이블 메커니즘에 대한 체계적인 분석을 위한 첫걸음을 내디딘다. 지도 분류와 대조적으로, 대부분의 RLVR 알고리즘은 롤아웃 기반 조건을 포함하는데, 즉 레이블이 훈련에 미치는 영향은 현재 정책이 해당 레이블을 실현하는 롤아웃을 생성할 수 있는지 여부에 달려 있으며, 이 특성은 노이즈 레이블에도 자연스럽게 확장된다. 이 관찰을 바탕으로 우리는 두 가지 유형의 노이즈를 구분한다: 데이터 효율성을 저하시키는 비활성 노이즈 레이블과, 강화되어 모델을 잘못된 분포로 왜곡시킬 위험이 있는 활성 노이즈 레이블이다. 노이즈 샘플을 포함한 훈련 실험을 통해 우리는 초기 정확도 일관성 현상을 확인했다. 노이즈 샘플의 성능이 후기 단계에서 뒤처지기 시작하지만, 훈련 초기에는 깨끗한 샘플과 노이즈 샘플 모두에서 정확도가 유사하게 증가하는 것이다. 이러한 역동성에 착안하여, 우리는 온라인 레이블 정제(OLR)를 제안한다. OLR은 두 가지 조건(다수결 답변의 롤아웃 통과율에서 양의 기울기 존재, 업데이트 간 안정적인 역사적 일관성)이 충족될 때 다수결 답변으로 잠재적 노이즈 레이블을 점진적으로 수정하며, 정책이 개선됨에 따라 점진적인 자기 수정을 가능하게 한다. 우리는 OLR을 6개의 인-분포 수학적 추론 벤치마크(AIME24/25, AMC, MATH-500, Minerva, Olympiad)와 3개의 아웃-오브-분포 과제(ARC-c, GPQA-diamond, MMLU-pro)에서 평가했다. 0.1부터 0.9까지의 노이즈 비율 전반에 걸쳐, OLR은 비활성 및 활성 노이즈 레이블 설정 하에서도 견고성을 지속적으로 개선했으며, 인-분포 벤치마크에서는 평균 3.6%~3.9%, 아웃-오브-분포 평가에서는 평균 3.3%~4.6%의 성능 향상을 달성했다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

LLM은 노이즈가 포함된 감독 하에서도 강건하게 추론하는 방법을 학습할 수 있을까요?

Can LLMs Learn to Reason Robustly under Noisy Supervision?

초록

Support