Jigsaw-R1: 직소 퍼즐을 활용한 규칙 기반 시각 강화 학습 연구

초록

규칙 기반 강화 학습(RL)을 다중 모달 대형 언어 모델(MLLM)에 적용하는 것은 텍스트 전용 도메인에서의 연구 결과와는 다른 독특한 도전과 잠재적 편차를 야기하며, 특히 지각 중심 작업에서 두드러집니다. 본 논문은 퍼즐 조각을 구조화된 실험 프레임워크로 사용하여 규칙 기반 시각적 RL에 대한 포괄적인 연구를 제공합니다. 퍼즐 조각은 내재된 그라운드 트루스, 조절 가능한 난이도, 복잡한 의사결정을 요구한다는 점에서 이 연구에 이상적입니다. 우리의 연구는 몇 가지 주요 발견을 제시합니다: 첫째, 가장 간단한 퍼즐 조각에서 거의 무작위 추측 수준으로 시작한 MLLM이 미세 조정을 통해 거의 완벽한 정확도를 달성하고 복잡하고 보지 못한 구성으로 일반화할 수 있음을 확인했습니다. 둘째, 퍼즐 조각에 대한 훈련은 다른 시각적 작업으로의 일반화를 유도할 수 있으며, 그 효과는 특정 작업 구성에 따라 달라집니다. 셋째, MLLM은 명시적 추론 없이도 학습하고 일반화할 수 있지만, 오픈소스 모델은 종종 직접 답변을 선호합니다. 결과적으로, 단계별 추론을 위해 훈련된 경우에도 최종 답을 도출하는 과정에서 사고 과정을 무시할 수 있습니다. 넷째, 복잡한 추론 패턴은 새롭게 나타나는 것이 아니라 사전에 존재하는 것으로 보이며, 훈련과 작업 난이도가 증가함에 따라 그 빈도가 증가합니다. 마지막으로, 우리의 결과는 RL이 지도 미세 조정(SFT)보다 더 효과적인 일반화를 보이며, 초기 SFT 콜드 스타트 단계가 후속 RL 최적화를 방해할 수 있음을 보여줍니다. 비록 이러한 관찰이 퍼즐 조각을 기반으로 하며 다른 시각적 작업에서는 다를 수 있지만, 이 연구는 규칙 기반 시각적 RL과 다중 모달 학습에서의 잠재력에 대한 집단적 이해라는 더 큰 퍼즐에 귀중한 조각을 제공합니다. 코드는 https://github.com/zifuwanggg/Jigsaw-R1에서 확인할 수 있습니다.

English

The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: Firstly, we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. Secondly, training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. Thirdly, MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. Fourthly, we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. Finally, our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.

Jigsaw-R1: 직소 퍼즐을 활용한 규칙 기반 시각 강화 학습 연구

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

초록

Support