TinyV: 검증 과정에서의 거짓 음성 감소가 LLM 추론을 위한 강화 학습을 개선한다

초록

강화 학습(Reinforcement Learning, RL)은 보상 신호를 통해 정책을 최적화함으로써 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키는 강력한 도구로 자리 잡았습니다. 그러나 RL의 성공은 검증자(verifier)가 제공하는 보상의 신뢰성에 달려 있습니다. 본 논문에서는 검증자가 올바른 모델 출력을 잘못 거부하는 '거짓 부정(false negatives)'이라는 보편적인 문제를 밝히고 분석합니다. Big-Math-RL-Verified 데이터셋에 대한 심층 연구를 통해 모델이 생성한 응답의 38% 이상이 거짓 부정으로 인해 올바른 답변을 인식하지 못하는 것으로 나타났습니다. 우리는 실험적 및 이론적으로 이러한 거짓 부정이 유익한 기울기 신호를 박탈하고 수렴 속도를 늦춤으로써 RL 학습에 심각한 악영향을 미친다는 것을 보여줍니다. 이를 완화하기 위해 기존 규칙 기반 방법을 보완하는 경량 LLM 기반 검증자인 TinyV를 제안합니다. TinyV는 동적으로 잠재적인 거짓 부정을 식별하고 유효한 응답을 복구하여 더 정확한 보상 추정치를 생성합니다. 여러 수학 추론 벤치마크에서 TinyV를 통합함으로써 통과율을 최대 10%까지 향상시키고 기준선 대비 수렴 속도를 가속화했습니다. 우리의 연구 결과는 검증자의 거짓 부정 문제를 해결하는 것이 얼마나 중요한지 강조하며, LLM의 RL 기반 미세 조정을 개선하기 위한 실용적인 접근 방식을 제시합니다. 코드는 https://github.com/uw-nsl/TinyV에서 확인할 수 있습니다.

English

Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.

TinyV: 검증 과정에서의 거짓 음성 감소가 LLM 추론을 위한 강화 학습을 개선한다

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

초록

Support