面向視覺中心推理的拼圖課程GRPO
Puzzle Curriculum GRPO for Vision-Centric Reasoning
December 16, 2025
作者: Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk
cs.AI
摘要
近期如成果監督式GRPO等強化學習方法雖在視覺語言模型的思維鏈推理方面取得進展,但關鍵問題仍存:(一)依賴昂貴且帶噪聲的人工標註或外部驗證器;(二)GRPO中平坦稀疏的獎勵機制;(三)推理鏈與最終答案間的邏輯不一致性。我們提出謎題課程化GRPO(PC-GRPO),這是一種具可驗證獎勵的無監督強化學習方案,能在無需標註或外部驗證器的情況下強化VLMs的視覺推理能力。PC-GRPO通過三種自監督謎題環境替代標註:PatchFit、旋轉謎題(採用二元獎勵)和拼圖謎題(通過分級部分獎勵緩解獎勵稀疏性)。為應對平坦獎勵與消失的群組相對優勢,我們引入難度感知課程機制,動態加權樣本難度並在中等難度區間達到峰值。我們在後訓練階段持續監控推理-答案一致性(RAC):與大型語言模型中原始GRPO的報告相呼應,RAC通常先升後降;我們的課程設計延遲了這種衰退,而強制一致性獎勵機制進一步提升RAC。RAC與下游任務準確率呈現相關性。在多元基準測試中,基於Qwen-7B和Qwen-3B架構的PC-GRPO顯著提升了推理質量、訓練穩定性及終端任務準確率,為VLMs提供了一條可擴展、可驗證且可解釋的強化學習後訓練路徑。
English
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.