MiCo:多图像对比强化视觉推理
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
June 27, 2025
作者: Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao
cs.AI
摘要
本研究探索了如何通过思维链(CoT)推理来关联多幅图像中的视觉线索。一种直接的方法是调整基于规则的强化学习以适应视觉-语言模型(VLMs)。然而,这类方法通常依赖于人工构建的问答对,在处理跨图像的精细视觉细节和复杂逻辑时尤为困难。受自监督视觉表示学习的启发,我们观察到图像本身蕴含的约束可作为监督信号。基于这一洞见,我们构建了由同一图像的两个增强视图和第三幅相似但不同的图像组成的三元组。在训练过程中,模型被引导生成推理过程以比较这些图像(即判断相同或不同)。随后,我们利用基于规则的强化学习对模型进行优化。由于图像间高度视觉相似且存在增强处理,模型必须关注细微的视觉变化并进行逻辑推理才能成功。实验表明,尽管仅针对视觉比较任务进行训练,所学到的推理能力能有效泛化至广泛的问题类型。在不依赖任何人工标注问答对的情况下,我们的方法在多图像推理基准上取得了显著提升,并在通用视觉任务中展现出强劲性能。
English
This work explores enabling Chain-of-Thought (CoT) reasoning to link visual
cues across multiple images. A straightforward solution is to adapt rule-based
reinforcement learning for Vision-Language Models (VLMs). However, such methods
typically rely on manually curated question-answer pairs, which can be
particularly challenging when dealing with fine grained visual details and
complex logic across images. Inspired by self-supervised visual representation
learning, we observe that images contain inherent constraints that can serve as
supervision. Based on this insight, we construct image triplets comprising two
augmented views of the same image and a third, similar but distinct image.
During training, the model is prompted to generate a reasoning process to
compare these images (i.e., determine same or different). Then we optimize the
model with rule-based reinforcement learning. Due to the high visual similarity
and the presence of augmentations, the model must attend to subtle visual
changes and perform logical reasoning to succeed. Experiments show that,
although trained solely on visual comparison tasks, the learned reasoning
ability generalizes effectively to a wide range of questions. Without relying
on any human-annotated question-answer pairs, our method achieves significant
improvements on multi-image reasoning benchmarks and shows strong performance
on general vision tasks.