Jigsaw-R1：基于规则的视觉强化学习在拼图游戏中的研究

摘要

将基于规则的强化学习（RL）应用于多模态大语言模型（MLLMs）带来了独特的挑战，并可能偏离纯文本领域的研究发现，尤其是在感知密集型任务中。本文通过拼图游戏这一结构化实验框架，对基于规则的视觉RL进行了全面研究。拼图游戏具备固有的真实标签、可调节的难度以及对复杂决策的需求，使其成为本研究的理想选择。我们的研究揭示了几个关键发现：首先，我们发现MLLMs在最初对最简单的拼图游戏表现接近随机猜测，但通过微调后，能够达到近乎完美的准确率，并能泛化到复杂、未见过的配置。其次，在拼图游戏上的训练能够诱导对其他视觉任务的泛化，其效果与特定任务配置相关。第三，MLLMs无论是否进行显式推理都能学习和泛化，尽管开源模型往往倾向于直接回答。因此，即使训练了逐步推理，它们也可能在得出最终答案时忽略思考过程。第四，我们观察到复杂的推理模式似乎是预先存在的而非涌现的，其频率随着训练和任务难度的增加而上升。最后，我们的结果表明，RL比监督微调（SFT）展现出更有效的泛化能力，而初始的SFT冷启动阶段可能会阻碍后续的RL优化。尽管这些观察基于拼图游戏，且在其他视觉任务中可能有所不同，但本研究为集体理解基于规则的视觉RL及其在多模态学习中的潜力贡献了一块宝贵的拼图。代码可在以下网址获取：https://github.com/zifuwanggg/Jigsaw-R1。

English

The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: Firstly, we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. Secondly, training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. Thirdly, MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. Fourthly, we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. Finally, our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.

Jigsaw-R1：基于规则的视觉强化学习在拼图游戏中的研究

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

摘要

Support