MiCo：強化学習における視覚的推論のためのマルチイメージコントラスト

要旨

本研究では、複数画像にわたる視覚的手がかりを結びつけるためのChain-of-Thought（CoT）推論の実現を探求しています。単純な解決策として、Vision-Language Models（VLM）にルールベースの強化学習を適用する方法が考えられます。しかし、このような手法は通常、手動で作成された質問-回答ペアに依存しており、細かい視覚的詳細や画像間の複雑な論理を扱う際に特に困難を伴います。自己教師あり視覚表現学習に着想を得て、私たちは画像が内在的な制約を含んでおり、それが教師信号として機能し得ることに着目しました。この洞察に基づき、同じ画像の2つの拡張ビューと、類似しているが異なる第3の画像からなる画像トリプレットを構築します。訓練中、モデルはこれらの画像を比較する（つまり、同じか異なるかを判断する）ための推論プロセスを生成するよう促されます。その後、ルールベースの強化学習を用いてモデルを最適化します。高い視覚的類似性と拡張の存在により、モデルは微妙な視覚的変化に注意を払い、論理的推論を実行して成功しなければなりません。実験結果から、視覚比較タスクのみで訓練されたにもかかわらず、学習された推論能力が幅広い質問に効果的に一般化することが示されています。人間が注釈を付けた質問-回答ペアに一切依存することなく、私たちの手法は複数画像推論ベンチマークで大幅な改善を達成し、一般的な視覚タスクにおいても強力な性能を示しています。

English

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.

MiCo：強化学習における視覚的推論のためのマルチイメージコントラスト

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

要旨

Support