GRIT：教導多模態大型語言模型以圖像思考

摘要

近期研究表明，強化學習（Reinforcement Learning, RL）在構建推理模型方面具有顯著效果，這些模型在生成最終答案之前能夠清晰地闡述思維鏈。然而，儘管在視覺語言任務中實現推理的技術不斷進步，現有的開源視覺推理模型通常僅使用純自然語言生成推理內容，缺乏對視覺信息的明確整合。這限制了它們生成清晰且視覺基礎紮實的推理鏈的能力。為此，我們提出了基於圖像與文本的接地推理（Grounded Reasoning with Images and Texts, GRIT），這是一種訓練多模態語言模型（MLLMs）進行圖像思維的新方法。GRIT引入了一種接地推理範式，在該範式中，模型生成的推理鏈交織著自然語言和明確的邊界框座標。這些座標指向模型在推理過程中參考的輸入圖像區域。此外，GRIT配備了一種基於GRPO算法的強化學習方法——GRPO-GR。GRPO-GR採用了專注於最終答案準確性和接地推理輸出格式的穩健獎勵機制，從而無需帶有推理鏈註釋或明確邊界框標籤的數據。因此，GRIT實現了卓越的數據效率，僅需現有數據集中的20個圖像-問題-答案三元組即可。全面評估表明，GRIT能有效訓練MLLMs生成連貫且視覺基礎紮實的推理鏈，成功實現了推理與接地能力的統一。

English

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

GRIT：教導多模態大型語言模型以圖像思考

GRIT: Teaching MLLMs to Think with Images

摘要

Support