ChatPaper.aiChatPaper

Visionary-R1:利用强化学习缓解视觉推理中的捷径问题

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

May 20, 2025
作者: Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, Kaiyang Zhou
cs.AI

摘要

学习通用推理能力一直是人工智能领域的一个长期挑战。近期在大语言模型(LLMs)如DeepSeek-R1的研究表明,通过GRPO等强化学习技术,预训练的LLMs能够利用简单的问答对发展出推理能力。本文旨在通过强化学习和视觉问答对,训练视觉语言模型(VLMs)在图像数据上进行推理,而无需任何显式的思维链(CoT)监督。我们的研究发现,仅对VLM应用强化学习——即提示模型在提供答案前生成推理链——会导致模型从简单问题中寻找捷径,从而降低其在未见数据分布上的泛化能力。我们认为,缓解捷径学习的关键在于鼓励模型在推理前先对图像进行解释。因此,我们训练模型遵循“描述-推理-回答”的输出格式:首先生成图像的详细描述,随后构建详尽的推理链。在273K个无CoT的视觉问答对上仅使用强化学习进行训练后,我们的模型Visionary-R1在多个视觉推理基准测试中超越了GPT-4o、Claude3.5-Sonnet和Gemini-1.5-Pro等强大的多模态模型。
English
Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.

Summary

AI-Generated Summary

PDF111May 21, 2025