自监督引导增强视觉指令调优

摘要

多模态大语言模型（MLLMs）在众多视觉语言任务中表现优异，但在需要细粒度视觉推理的视觉中心型问题上往往表现不佳。最新研究表明，这一局限并非源于视觉表征能力薄弱，而是由于指令微调过程中未能充分利用视觉信息——许多任务仅凭语言先验即可部分解决。我们提出一种轻量级解决方案，通过将少量以自然语言指令表达的视觉基础自监督任务融入视觉指令微调过程。通过将旋转预测、颜色匹配、跨视角对应等经典自监督预训练任务重构为“图像-指令-响应”三元组，我们引入了必须依赖视觉证据才能解决的监督信号。该方法无需人工标注、无需调整模型架构、无需新增训练阶段。在多种模型架构、训练机制和基准测试中，仅需注入少量（3%-10%）此类视觉基础指令，即可持续提升视觉中心型评估任务的性能。我们的研究证明，通过对训练数据分布进行简单调整，采用视觉基础自监督任务的指令微调可有效增强MLLMs的视觉推理能力。代码地址：https://github.com/sirkosophia/V-GIFT

English

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT