自己教師ありガイダンスによる視覚的指示チューニングの強化

要旨

マルチモーダル大規模言語モデル（MLLM）は多くの視覚言語タスクで優れた性能を発揮するが、細粒度の視覚的推論を必要とする視覚中心の問題には苦戦することが多い。最近の研究によれば、この制限は視覚表現の弱さによるものではなく、指示チューニングにおける視覚情報の未活用に起因する。多くのタスクは言語事前分布のみで部分的に解決可能であるためである。本論文では、自然言語指示で表現された少数の視覚基盤型自己教師ありタスクを視覚的指示チューニングに追加する、シンプルで軽量な手法を提案する。回転予測、色一致、異視点対応といった古典的な自己教師あり事前タスクを画像-指示-応答の三つ組として再構成することで、視覚的証拠に依存せずには解決できない監督情報を導入する。本手法は人的アノテーション、アーキテクチャ変更、追加の学習段階を一切必要としない。複数のモデル、学習方法、ベンチマークにおいて、このような視覚基盤型指示をわずか（3-10%）注入するだけで、視覚中心の評価課題における性能が一貫して向上する。本研究結果は、学習データ分布へのシンプルな調整を通じてMLLMの視覚的推論能力を改善する有力な手段として、視覚基盤型自己教師あり学習タスクを用いた指示チューニングの有効性を明らかにする。コードは以下で公開：https://github.com/sirkosophia/V-GIFT

English

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

自己教師ありガイダンスによる視覚的指示チューニングの強化

Boosting Visual Instruction Tuning with Self-Supervised Guidance

要旨

Support