大型語言模型何時能學會弱監督推理？

摘要

大型語言模型透過可驗證獎勵的強化學習（RLVR）已實現顯著的推理能力提升。然而隨著模型能力增長，建構高品質獎勵信號變得日益困難，這使得理解RLVR在何種弱監督條件下仍能成功至關重要。我們針對不同模型家族與推理領域，在三個弱監督設定下進行系統性實證研究：稀缺數據、噪聲獎勵及自監督代理獎勵。研究發現泛化能力受訓練獎勵飽和動態支配：具泛化能力的模型會呈現延長的預飽和階段，期間訓練獎勵與下游性能同步提升；而快速飽和的模型則傾向記憶而非學習。我們定義「推理忠實性」（即中間步驟對最終答案的邏輯支持程度）作為預測模型所屬階段的預強化學習屬性，而僅憑輸出多樣性無法提供有效信息。基於這些發現，我們分離持續預訓練與監督式微調的貢獻，發現對顯性推理軌跡進行SFT是弱監督下實現泛化的必要條件，而領域數據的持續預訓練則能放大此效應。將這些方法共同應用於Llama3.2-3B基礎模型後，成功在基礎模型原先失敗的三種設定中實現全面泛化。

English

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

大型語言模型何時能學會弱監督推理？

When Can LLMs Learn to Reason with Weak Supervision?

摘要

Support