시각적 퍼즐 사후 학습이 MLLM 성능을 향상시킨다

초록

강화 학습 기반 사후 훈련은 최근 멀티모달 대형 언어 모델(MLLMs)의 정렬 및 추론 능력을 향상시키는 강력한 패러다임으로 부상했습니다. 비전 중심의 사후 훈련은 MLLMs의 시각적 신호에 대한 내재적 이해를 강화하는 데 필수적이지만, 현재의 사후 훈련 패러다임은 주로 텍스트 중심으로 이루어져 있으며, 밀집된 시각적 입력은 텍스트 기반 추론을 위한 희소한 단서를 추출하는 데만 활용됩니다. 이 방향으로 몇 가지 접근법이 존재하지만, 이들은 여전히 텍스트를 중간 매개체로 사용하거나 추가적인 시각적 생성 설계를 도입하는 경우가 많습니다. 본 연구에서는 MLLMs의 시각적 이해를 강화하기 위해 설계된 일반적인 자기 지도 사후 훈련 프레임워크인 Visual Jigsaw를 소개합니다. Visual Jigsaw는 일반적인 순서화 작업으로 구성됩니다: 시각적 입력이 분할되고 섞인 후, 모델은 자연어로 올바른 순열을 생성하여 시각적 정보를 재구성해야 합니다. 이는 검증 가능한 보상으로부터의 강화 학습(RLVR)과 자연스럽게 조화를 이루며, 추가적인 시각적 생성 구성 요소가 필요하지 않고, 어떠한 주석 없이도 자동으로 지도 신호를 도출합니다. 우리는 Visual Jigsaw를 이미지, 비디오, 3D 데이터를 포함한 세 가지 시각적 모달리티에 걸쳐 구체화합니다. 광범위한 실험을 통해 세밀한 지각, 시간적 추론, 3D 공간 이해에서의 상당한 개선을 입증합니다. 우리의 연구 결과는 사후 훈련 MLLMs에서 자기 지도 비전 중심 작업의 잠재력을 강조하며, 비전 중심의 사전 텍스트 설계에 대한 추가 연구를 촉진하고자 합니다. 프로젝트 페이지: https://penghao-wu.github.io/visual_jigsaw/

English

Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/

시각적 퍼즐 사후 학습이 MLLM 성능을 향상시킨다

Visual Jigsaw Post-Training Improves MLLMs

초록

Support