MiCo：多圖像對比強化視覺推理

摘要

本研究探討了如何實現跨多圖像的視覺線索鏈式思維（Chain-of-Thought, CoT）推理。一種直接的解決方案是將基於規則的強化學習應用於視覺-語言模型（Vision-Language Models, VLMs）。然而，此類方法通常依賴於人工策劃的問答對，這在處理細粒度視覺細節和跨圖像的複雜邏輯時尤具挑戰性。受自監督視覺表示學習的啟發，我們觀察到圖像中蘊含的固有約束可作為監督信號。基於這一洞察，我們構建了由同一圖像的兩個增強視圖及第三張相似但不同的圖像組成的圖像三元組。在訓練過程中，模型被引導生成推理過程以比較這些圖像（即判斷相同或不同）。隨後，我們利用基於規則的強化學習對模型進行優化。由於圖像間的高度視覺相似性及增強處理的存在，模型必須關注細微的視覺變化並進行邏輯推理才能成功。實驗表明，儘管僅在視覺比較任務上進行訓練，所學得的推理能力能有效泛化至廣泛的問題類型。在不依賴任何人類標註問答對的情況下，我們的方法在多圖像推理基準上取得了顯著提升，並在通用視覺任務中展現出強勁性能。

English

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.

MiCo：多圖像對比強化視覺推理

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

摘要

Support