Cog-DRIFT：基于自适应重构实例的探索实现从复杂推理问题中学习

摘要

基于可验证奖励的强化学习（RLVR）虽已提升大语言模型的推理能力，但存在根本局限：模型无法从超出当前策略解决能力的问题中学习，因为这类问题无法产生有效奖励信号。我们提出一种基于任务重构的简洁有效解决方案：将具有挑战性的开放性问题转化为认知上更简单的变体——如选择题和完形填空题——这些形式在保留原问题答案的同时，能缩减有效搜索空间并提供更密集的学习信号。这些重构任务覆盖从判别式到生成式的任务谱系，我们借此实现学习引导：模型先通过结构化简易格式学习，再将习得知识迁移回原始开放问题以提升表现。基于此，我们提出Cog-DRIFT框架，该框架不仅构建重构任务变体，还根据难度将其组织成自适应课程。训练从易到难渐进推进，使模型能从标准RL后训练中原本零信号的问题中学习。Cog-DRIFT不仅在原不可解难题上实现显著提升（Qwen绝对提升+10.11%，Llama提升+8.64%），还能良好泛化至其他保留数据集。在2个模型和6个推理基准测试中，我们的方法持续优于标准GRPO及强引导探索基线，平均较次优基线提升+4.72%（Qwen）和+3.23%（Llama）。进一步实验表明，Cog-DRIFT能提升测试时的pass@k指标，且课程学习可提高样本效率。总体而言，我们的研究证实任务重构与课程学习是突破大模型后训练探索瓶颈的有效范式。

English

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants -- such as multiple-choice and cloze formats -- that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.