MiCo：多图像对比强化视觉推理

摘要

本研究探索了如何通过思维链（CoT）推理来关联多幅图像中的视觉线索。一种直接的方法是调整基于规则的强化学习以适应视觉-语言模型（VLMs）。然而，这类方法通常依赖于人工构建的问答对，在处理跨图像的精细视觉细节和复杂逻辑时尤为困难。受自监督视觉表示学习的启发，我们观察到图像本身蕴含的约束可作为监督信号。基于这一洞见，我们构建了由同一图像的两个增强视图和第三幅相似但不同的图像组成的三元组。在训练过程中，模型被引导生成推理过程以比较这些图像（即判断相同或不同）。随后，我们利用基于规则的强化学习对模型进行优化。由于图像间高度视觉相似且存在增强处理，模型必须关注细微的视觉变化并进行逻辑推理才能成功。实验表明，尽管仅针对视觉比较任务进行训练，所学到的推理能力能有效泛化至广泛的问题类型。在不依赖任何人工标注问答对的情况下，我们的方法在多图像推理基准上取得了显著提升，并在通用视觉任务中展现出强劲性能。

English

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.

MiCo：多图像对比强化视觉推理

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

摘要

Support