视觉拼图后训练提升多模态大语言模型性能
Visual Jigsaw Post-Training Improves MLLMs
September 29, 2025
作者: Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
cs.AI
摘要
基于强化学习的后训练技术近期崭露头角,成为提升多模态大语言模型(MLLMs)对齐与推理能力的强大范式。尽管以视觉为中心的后训练对于增强MLLMs对视觉信号的内在理解至关重要,但当前的后训练范式仍主要围绕文本展开,密集的视觉输入仅被用来提取稀疏线索以支持基于文本的推理。虽然已有一些探索朝此方向迈进,但这些方法往往仍依赖文本作为中间媒介,或引入了额外的视觉生成设计。在本研究中,我们提出了“视觉拼图”(Visual Jigsaw),一个旨在强化MLLMs视觉理解的通用自监督后训练框架。视觉拼图被构建为一个通用的排序任务:视觉输入被分割、打乱,模型需通过生成自然语言中的正确排列来重建视觉信息。这一设计自然契合了基于可验证奖励的强化学习(RLVR),无需额外的视觉生成组件,且其监督信号自动生成,无需任何标注。我们将视觉拼图应用于三种视觉模态,包括图像、视频和3D数据。大量实验表明,该方法在细粒度感知、时序推理及3D空间理解方面带来了显著提升。我们的发现凸显了自监督视觉中心任务在后训练MLLMs中的潜力,旨在激发更多关于视觉中心预训练设计的深入研究。项目页面:https://penghao-wu.github.io/visual_jigsaw/
English
Reinforcement learning based post-training has recently emerged as a powerful
paradigm for enhancing the alignment and reasoning capabilities of multimodal
large language models (MLLMs). While vision-centric post-training is crucial
for enhancing MLLMs' intrinsic understanding of visual signals, current
post-training paradigms are predominantly text-centric, where dense visual
inputs are only leveraged to extract sparse cues for text-based reasoning.
There exist a few approaches in this direction, however, they often still rely
on text as an intermediate mediator or introduce additional visual generative
designs. In this work, we introduce Visual Jigsaw, a generic self-supervised
post-training framework designed to strengthen visual understanding in MLLMs.
Visual Jigsaw is formulated as a general ordering task: visual inputs are
partitioned, shuffled, and the model must reconstruct the visual information by
producing the correct permutation in natural language. This naturally aligns
with reinforcement learning from verifiable rewards (RLVR), requires no
additional visual generative components, and derives its supervisory signal
automatically without any annotations. We instantiate Visual Jigsaw across
three visual modalities, including images, videos, and 3D data. Extensive
experiments demonstrate substantial improvements in fine-grained perception,
temporal reasoning, and 3D spatial understanding. Our findings highlight the
potential of self-supervised vision-centric tasks in post-training MLLMs and
aim to inspire further research on vision-centric pretext designs. Project
Page: https://penghao-wu.github.io/visual_jigsaw/