ChatPaper.aiChatPaper

金鹅妙计:从未经验证的网络文本中无限生成RLVR任务的简易技巧

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

January 30, 2026
作者: Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, Yejin Choi
cs.AI

摘要

可验证奖励强化学习(RLVR)已成为解锁大语言模型复杂推理能力的关键技术。然而,现有可验证数据的匮乏制约了RL的扩展能力,导致模型在长期训练中改进效果逐渐饱和。为突破此瓶颈,我们提出"金鹅"策略——通过构建填空任务的多选题版本,从不可验证的网络文本中合成无限RLVR任务的简易方法。给定源文本,我们引导大语言模型识别并掩码关键推理步骤,继而生成多样化的合理干扰项。该方法使我们能利用通常被传统RLVR数据构建排除在外的、富含推理的不可验证语料(如科学教材),合成包含逾70万个任务的GooseReason-0.7M大规模RLVR数据集,覆盖数学、编程与通用科学领域。实验表明,GooseReason能有效激活已饱和的RLVR模型,在持续强化学习中实现稳健的持续增益,使15亿和40亿参数指令模型在15个多样化基准测试中刷新最优结果。最终,我们将"金鹅"部署于真实网络安防场景:从原始FineWeb抓取数据中合成该领域RLVR任务——此前该领域完全缺乏RLVR数据。基于合成数据GooseReason-Cyber训练Qwen3-40亿指令模型后,其在网络安全领域表现超越经过大量领域专用预训练与后训练的70亿参数专业模型,创下新纪录。这彰显了通过开发海量富含推理的不可验证网络文本,自动扩展RLVR数据的巨大潜力。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
PDF593February 3, 2026