언제 LLM이 약한 감독으로 추론을 배울 수 있는가?

초록

대규모 언어 모델은 검증 가능한 보상을 활용한 강화 학습(RLVR)을 통해 추론 능력에서 상당한 향상을 이루어 왔습니다. 그러나 모델 역량이 성장함에 따라 고품질 보상 신호를 구축하는 것은 점점 더 어려워지고 있어, 약한 형태의 감독 하에서 RLVR이 언제 성공할 수 있는지 이해하는 것이 중요해졌습니다. 우리는 세 가지 약한 감독 설정—데이터 부족, 노이즈가 있는 보상, 자기 지도 프록시 보상—하에서 다양한 모델 패밀리와 추론 영역에 걸쳐 체계적인 실증 연구를 수행했습니다. 연구 결과, 일반화 능력은 훈련 보상 포화 역학에 의해 지배되는 것으로 나타났습니다. 일반화를 보이는 모델들은 훈련 보상과 다운스트림 성능이 함께 상승하는 장기간의 포화 이전 단계를 거치는 반면, 빠르게 포화되는 모델들은 학습하기보다 암기하는 경향을 보였습니다. 우리는 중간 단계들이 최종 답변을 논리적으로 지지하는 정도로 정의되는 '추론 충실도'가 모델이 어느 영역에 속하는지를 예측하는 RL 이전 속성임을 확인했으며, 출력 다양성만으로는 정보를 제공하지 못한다는 점을 발견했습니다. 이러한 결과에 기반하여, 우리는 지속 사전 학습과 지도 미세 조정의 기여를 분리하여 분석했습니다. 그 결과, 약한 감독 하에서 일반화를 위해서는 명시적인 추론 과정에 대한 지도 미세 조정이 필수적인 반면, 해당 도메인 데이터에 대한 지속 사전 학습은 그 효과를 증폭시키는 것으로 나타났습니다. 이러한 개입 방법들을 Llama3.2-3B-Base 모델에 함께 적용했을 때, 기존 기본 모델이 실패했던 세 가지 설정 모두에서 일반화가 가능해졌습니다.

English

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

언제 LLM이 약한 감독으로 추론을 배울 수 있는가?

When Can LLMs Learn to Reason with Weak Supervision?

초록

Support