Cog-DRIFT：基于自适应重构实例的探索实现从复杂推理问题的学习中

摘要

基于可验证奖励的强化学习（RLVR）虽已提升了大语言模型的推理能力，但其根本局限依然存在：模型无法从当前策略下难以解决的问题中学习，因为这些问题无法产生有效的奖励信号。我们提出了一种基于任务重构的简洁有效解决方案：将具有挑战性的开放性问题转化为认知上更简单的变体——如多项选择和完形填空格式——这些形式在保留原问题答案的同时，能缩减有效搜索空间并提供更密集的学习信号。这些重构任务形成了从判别式到生成式的连续谱系，我们借此实现学习引导：模型首先从结构化、更简单的形式中学习，随后这种知识能反向迁移至原始开放性问题以提升表现。基于此洞见，我们提出了Cog-DRIFT框架，该框架不仅构建重构任务变体，还根据难度将其组织成自适应课程。训练遵循从易到难的渐进路径，使模型能够从标准RL后训练中原本零信号的问题中学习。Cog-DRIFT不仅在原难解问题上实现显著提升（Qwen绝对提升+10.11%，Llama提升+8.64%），还能良好泛化至其他保留数据集。在2个模型和6个推理基准测试中，我们的方法持续优于标准GRPO及强引导探索基线，平均较次优基线分别提升4.72%（Qwen）和3.23%（Llama）。我们进一步证明Cog-DRIFT能提升测试时的pass@k指标，且课程学习提高了样本效率。总体而言，我们的研究结果凸显了任务重构与课程学习作为突破LLM后训练探索瓶颈的有效范式。

English

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants -- such as multiple-choice and cloze formats -- that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.

Cog-DRIFT：基于自适应重构实例的探索实现从复杂推理问题的学习中

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

摘要

Support