金鵝妙計:從未經驗證的網路文本中無限合成RLVR任務的簡易技巧 (注:RLVR指"Reinforcement Learning with Verifiable Rewards",即可驗證獎勵的強化學習。標題採用意譯方式,將"Golden Goose"譯為「金鵝妙計」以呼應典故,同時保留「從未經驗證的網路文本中無限合成」的核心技術特徵。)
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
January 30, 2026
作者: Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, Yejin Choi
cs.AI
摘要
具可驗證獎勵的強化學習(RLVR)已成為解鎖大型語言模型複雜推理能力的關鍵技術。然而,現有可驗證數據的匱乏限制了強化學習的擴展,導致模型在長期訓練後改進效果逐漸飽和。為突破此瓶頸,我們提出「金鵝」方法——通過將填空任務轉化為選擇題形式,從不可驗證的網絡文本中自動合成無限量的RLVR任務。具體流程是:給定原始文本,先引導LLM識別並遮罩關鍵推理步驟,再生成一組多樣化且合理的干擾選項。此法使我們能利用通常被傳統RLVR數據構建排除在外的、富含推理內容的不可驗證語料(如科學教材),最終合成包含逾70萬個任務的大規模RLVR數據集GooseReason-0.7M,涵蓋數學、程式設計與通用科學領域。實證表明,GooseReason能有效重振已對現有RLVR數據飽和的模型,在持續強化學習中實現穩健的持續增益,並在15個多樣化基準測試中為15B和40B指令模型創下最新紀錄。最後,我們將金鵝方法應用於網絡安全領域的真實場景:從未經處理的FineWeb網頁碎片合成RLVR任務。使用所得數據集GooseReason-Cyber訓練Qwen3-4B-Instruct後,該模型在網絡安全任務中刷新紀錄,甚至超越經過大量領域專用預訓練與後訓練的70B專業模型。此成果凸顯了通過開發豐富但未被利用的、富含推理的不可驗證網絡文本,實現RLVR數據自動擴展的巨大潛力。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.