Cog-DRIFT: 적응적으로 재구성된 인스턴스를 통한 탐색이 어려운 추론 문제 학습을 가능하게 함

초록

검증 가능한 보상 강화학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시켜 왔지만, 근본적인 한계가 남아 있습니다: 모델은 현재 정책 하에서 해결하기 너무 어려운 문제로부터는 의미 있는 보상 신호를 얻을 수 없어 학습이 불가능하다는 점입니다. 우리는 과제 재구성에 기반한 간단하면서 효과적인 해결책을 제안합니다. 우리는 도전적인 개방형 문제를 인지적으로 더 단순한 변형(객관식 및 빈칸 채우기 형식 등)으로 변환함으로써 원래 답변을 보존하면서도 효과적 탐색 공간을 축소하고 더 밀집된 학습 신호를 제공합니다. 이러한 재구성은 판별적 과제부터 생성적 과제에 이르는 스펙트럼을 포괄하며, 우리는 이를 활용하여 학습을 부트스트랩합니다: 모델은 먼저 구조화되고 쉬운 형식으로 학습하며, 이렇게 얻은 지식은 원래의 개방형 문제에 대한 성능을 향상시키는 방향으로 전이됩니다. 이러한 통찰을 바탕으로, 우리는 재구성된 변형들을 구성하고 난이도에 따라 적응형 커리큘럼으로 조직하는 Cog-DRIFT 프레임워크를 소개합니다. 학습은 쉬운 형식에서 어려운 형식으로 진행되어, 표준 RL 사후 학습 하에서는 제로 신호를 주었던 문제들로부터도 모델이 학습할 수 있게 합니다. Cog-DRIFT는 원래 풀 수 없었던 어려운 문제에서만큼은 절대적 기준으로 Qwen +10.11%, Llama +8.64% 향상될 뿐만 아니라, 다른 보류된 데이터셋에도 잘 일반화됩니다. 2가지 모델과 6가지 추론 벤치마크에 걸쳐 우리의 방법은 표준 GRPO와 강력한 가이드 탐색 베이스라인을 지속적으로 능가했습니다. 평균적으로 Cog-DRIFT는 두 번째로 좋은 베이스라인 대비 Qwen +4.72%, Llama +3.23%의 향상을 보였습니다. 더 나아가 Cog-DRIFT는 테스트 시 pass@k를 향상시키며, 커리큘럼은 샘플 효율성을 높이는 것으로 나타났습니다. 전반적으로, 우리의 결과는 과제 재구성과 커리큘럼 학습이 LLM 사후 학습에서의 탐색 장벽을 극복하는 효과적인 패러다임임을 강조합니다.

English

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants -- such as multiple-choice and cloze formats -- that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.

Cog-DRIFT: 적응적으로 재구성된 인스턴스를 통한 탐색이 어려운 추론 문제 학습을 가능하게 함

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

초록

Support