視覺拼圖後訓練提升多模態大語言模型性能
Visual Jigsaw Post-Training Improves MLLMs
September 29, 2025
作者: Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
cs.AI
摘要
基於強化學習的後訓練方法近期已成為提升多模態大語言模型(MLLMs)對齊與推理能力的強大範式。雖然以視覺為核心的後訓練對於增強MLLMs對視覺信號的內在理解至關重要,但當前的後訓練範式仍主要圍繞文本展開,其中密集的視覺輸入僅被用於提取稀疏線索以支持基於文本的推理。儘管已有一些探索此方向的方法,但它們往往仍依賴文本作為中介或引入了額外的視覺生成設計。在本研究中,我們提出了視覺拼圖(Visual Jigsaw),這是一個通用的自監督後訓練框架,旨在強化MLLMs的視覺理解能力。視覺拼圖被構建為一項通用的排序任務:視覺輸入被分割、打亂,模型需通過生成自然語言中的正確排列來重建視覺信息。這自然契合了基於可驗證獎勵的強化學習(RLVR),無需額外的視覺生成組件,且自動獲取監督信號,無需任何標註。我們在圖像、視頻及3D數據三種視覺模態上實例化了視覺拼圖。大量實驗表明,該方法在細粒度感知、時間推理及3D空間理解方面帶來了顯著提升。我們的研究成果凸顯了自監督視覺核心任務在MLLMs後訓練中的潛力,並期望能激發更多關於視覺核心預訓練設計的研究。項目頁面:https://penghao-wu.github.io/visual_jigsaw/
English
Reinforcement learning based post-training has recently emerged as a powerful
paradigm for enhancing the alignment and reasoning capabilities of multimodal
large language models (MLLMs). While vision-centric post-training is crucial
for enhancing MLLMs' intrinsic understanding of visual signals, current
post-training paradigms are predominantly text-centric, where dense visual
inputs are only leveraged to extract sparse cues for text-based reasoning.
There exist a few approaches in this direction, however, they often still rely
on text as an intermediate mediator or introduce additional visual generative
designs. In this work, we introduce Visual Jigsaw, a generic self-supervised
post-training framework designed to strengthen visual understanding in MLLMs.
Visual Jigsaw is formulated as a general ordering task: visual inputs are
partitioned, shuffled, and the model must reconstruct the visual information by
producing the correct permutation in natural language. This naturally aligns
with reinforcement learning from verifiable rewards (RLVR), requires no
additional visual generative components, and derives its supervisory signal
automatically without any annotations. We instantiate Visual Jigsaw across
three visual modalities, including images, videos, and 3D data. Extensive
experiments demonstrate substantial improvements in fine-grained perception,
temporal reasoning, and 3D spatial understanding. Our findings highlight the
potential of self-supervised vision-centric tasks in post-training MLLMs and
aim to inspire further research on vision-centric pretext designs. Project
Page: https://penghao-wu.github.io/visual_jigsaw/