ChatPaper.aiChatPaper

面向视觉中心推理的拼图式课程GRPO方法

Puzzle Curriculum GRPO for Vision-Centric Reasoning

December 16, 2025
作者: Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk
cs.AI

摘要

近期,如结果监督型GRPO等强化学习方法虽在视觉语言模型(VLM)的思维链推理方面取得进展,但核心问题依然存在:(一)依赖成本高昂且含噪声的人工标注或外部验证器;(二)GRPO中扁平稀疏的奖励机制;(三)思维链推理与最终答案间的逻辑不一致性。我们提出拼图课程GRPO(PC-GRPO),一种基于可验证奖励的免监督强化学习方案(RLVR),无需标注或外部验证器即可增强VLM的视觉推理能力。PC-GRPO通过三个自监督拼图环境替代人工标签:PatchFit、旋转拼图(采用二元奖励)和碎片拼图(通过分级部分奖励缓解奖励稀疏问题)。针对扁平奖励与群体相对优势消失问题,我们引入难度感知课程机制,动态调整样本权重并在中等难度区间达到峰值。后训练阶段持续监控推理-答案一致性(RAC):与大型语言模型中原始GRPO的报道相呼应,RAC通常先升后降;我们的课程设计延缓了这一衰减,而强化一致性的奖励机制进一步提升了RAC。RAC与下游任务准确率呈正相关。在多样化基准测试中,基于Qwen-7B和Qwen-3B架构的PC-GRPO显著提升了推理质量、训练稳定性及终端任务准确率,为VLM提供了一条可扩展、可验证、可解释的强化学习后训练路径。
English
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
PDF302December 19, 2025