Jigsaw-R1:基於規則的視覺強化學習在拼圖遊戲中的研究
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
May 29, 2025
作者: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
cs.AI
摘要
基於規則的強化學習(RL)在多模態大型語言模型(MLLMs)中的應用,帶來了獨特的挑戰,並可能與純文本領域的研究發現有所偏差,尤其是在感知密集型任務中。本文提供了一項關於基於規則的視覺RL的全面研究,以拼圖作為結構化的實驗框架。拼圖具有內在的真實性、可調節的難度,並需要複雜的決策,使其成為本研究的理想選擇。我們的研究揭示了幾個關鍵發現:首先,我們發現MLLMs在最簡單的拼圖上初始表現接近隨機猜測,但通過微調後能夠達到近乎完美的準確率,並能泛化到複雜的、未見過的配置。其次,在拼圖上的訓練能夠誘導對其他視覺任務的泛化,其有效性與特定的任務配置相關。第三,MLLMs無論是否有明確的推理過程,都能學習並泛化,儘管開源模型通常更傾向於直接回答。因此,即使訓練了逐步推理,它們也可能忽略得出最終答案的思考過程。第四,我們觀察到複雜的推理模式似乎是預先存在的,而非新興的,其頻率隨著訓練和任務難度的增加而增加。最後,我們的結果表明,RL比監督微調(SFT)表現出更有效的泛化能力,而初始的SFT冷啟動階段可能會阻礙後續的RL優化。儘管這些觀察基於拼圖,並可能在其他視覺任務中有所不同,但這項研究為集體理解基於規則的視覺RL及其在多模態學習中的潛力貢獻了一塊有價值的拼圖。代碼可在以下網址獲取:https://github.com/zifuwanggg/Jigsaw-R1。
English
The application of rule-based reinforcement learning (RL) to multimodal large
language models (MLLMs) introduces unique challenges and potential deviations
from findings in text-only domains, particularly for perception-heavy tasks.
This paper provides a comprehensive study of rule-based visual RL, using jigsaw
puzzles as a structured experimental framework. Jigsaw puzzles offer inherent
ground truth, adjustable difficulty, and demand complex decision-making, making
them ideal for this study. Our research reveals several key findings:
Firstly, we find that MLLMs, initially performing near to random
guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and
generalize to complex, unseen configurations through fine-tuning.
Secondly, training on jigsaw puzzles can induce generalization to
other visual tasks, with effectiveness tied to specific task configurations.
Thirdly, MLLMs can learn and generalize with or without explicit
reasoning, though open-source models often favor direct answering.
Consequently, even when trained for step-by-step reasoning, they can ignore the
thinking process in deriving the final answer. Fourthly, we observe
that complex reasoning patterns appear to be pre-existing rather than emergent,
with their frequency increasing alongside training and task difficulty.
Finally, our results demonstrate that RL exhibits more effective
generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start
phase can hinder subsequent RL optimization. Although these observations are
based on jigsaw puzzles and may vary across other visual tasks, this research
contributes a valuable piece of jigsaw to the larger puzzle of collective
understanding rule-based visual RL and its potential in multimodal learning.
The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.