GRIT：指导多模态大语言模型以图像思考

摘要

近期研究表明，利用强化学习（RL）构建推理模型，在生成最终答案前明确表达思维链，具有显著效果。然而，尽管旨在提升视觉-语言任务推理能力的研究不断取得进展，现有的开源视觉推理模型通常仅用纯自然语言生成推理内容，缺乏对视觉信息的显式整合。这限制了它们生成清晰表达且视觉依据充分的推理链的能力。为此，我们提出了基于图像与文本的接地推理（GRIT），一种训练多模态语言模型（MLLMs）进行图像思维的新方法。GRIT引入了一种接地推理范式，模型生成的推理链交替使用自然语言和明确的边界框坐标，这些坐标指向模型在推理过程中参考的输入图像区域。此外，GRIT配备了一种基于GRPO算法的强化学习方法——GRPO-GR。GRPO-GR采用专注于最终答案准确性和接地推理输出格式的稳健奖励机制，从而无需带有推理链注释或明确边界框标签的数据。因此，GRIT实现了卓越的数据效率，仅需现有数据集中的20个图像-问题-答案三元组。全面评估表明，GRIT有效训练了MLLMs，使其能够生成连贯且视觉依据充分的推理链，成功实现了推理与接地能力的统一。

English

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

GRIT：指导多模态大语言模型以图像思考

GRIT: Teaching MLLMs to Think with Images

摘要

Support