大型语言模型何时能在弱监督下学会推理？

摘要

大型语言模型通过可验证奖励的强化学习（RLVR）实现了推理能力的显著提升。然而随着模型能力的增长，构建高质量奖励信号变得愈发困难，这使得理解RLVR在何种弱监督条件下仍能成功变得至关重要。我们在三种弱监督场景（稀缺数据、噪声奖励和自监督代理奖励）下，对不同模型家族和推理领域进行了系统性实证研究。研究发现泛化能力受训练奖励饱和动态的支配：能够泛化的模型会经历延长的预饱和阶段，此阶段训练奖励与下游性能同步提升；而快速饱和的模型则倾向于记忆而非学习。我们将推理忠实度（定义为中间步骤对最终答案的逻辑支持程度）确定为预测模型所处状态的关键预训练属性，而仅靠输出多样性则无法提供有效信息。基于这些发现，我们解析了持续预训练与监督微调（SFT）的各自作用：在弱监督下实现泛化需要基于显式推理链的SFT，而领域数据的持续预训练会放大这种效果。将这两种方法共同应用于Llama3.2-3B-Base模型后，该模型在原本失败的三种弱监督场景中均实现了泛化能力的突破。

English

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

大型语言模型何时能在弱监督下学会推理？

When Can LLMs Learn to Reason with Weak Supervision?

摘要

Support