GRIT:教導多模態大型語言模型以圖像思考
GRIT: Teaching MLLMs to Think with Images
May 21, 2025
作者: Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang
cs.AI
摘要
近期研究表明,強化學習(Reinforcement Learning, RL)在構建推理模型方面具有顯著效果,這些模型在生成最終答案之前能夠清晰地闡述思維鏈。然而,儘管在視覺語言任務中實現推理的技術不斷進步,現有的開源視覺推理模型通常僅使用純自然語言生成推理內容,缺乏對視覺信息的明確整合。這限制了它們生成清晰且視覺基礎紮實的推理鏈的能力。為此,我們提出了基於圖像與文本的接地推理(Grounded Reasoning with Images and Texts, GRIT),這是一種訓練多模態語言模型(MLLMs)進行圖像思維的新方法。GRIT引入了一種接地推理範式,在該範式中,模型生成的推理鏈交織著自然語言和明確的邊界框座標。這些座標指向模型在推理過程中參考的輸入圖像區域。此外,GRIT配備了一種基於GRPO算法的強化學習方法——GRPO-GR。GRPO-GR採用了專注於最終答案準確性和接地推理輸出格式的穩健獎勵機制,從而無需帶有推理鏈註釋或明確邊界框標籤的數據。因此,GRIT實現了卓越的數據效率,僅需現有數據集中的20個圖像-問題-答案三元組即可。全面評估表明,GRIT能有效訓練MLLMs生成連貫且視覺基礎紮實的推理鏈,成功實現了推理與接地能力的統一。
English
Recent studies have demonstrated the efficacy of using Reinforcement Learning
(RL) in building reasoning models that articulate chains of thoughts prior to
producing final answers. However, despite ongoing advances that aim at enabling
reasoning for vision-language tasks, existing open-source visual reasoning
models typically generate reasoning content with pure natural language, lacking
explicit integration of visual information. This limits their ability to
produce clearly articulated and visually grounded reasoning chains. To this
end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method
for training MLLMs to think with images. GRIT introduces a grounded reasoning
paradigm, in which models generate reasoning chains that interleave natural
language and explicit bounding box coordinates. These coordinates point to
regions of the input image that the model consults during its reasoning
process. Additionally, GRIT is equipped with a reinforcement learning approach,
GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused
on the final answer accuracy and format of the grounded reasoning output, which
eliminates the need for data with reasoning chain annotations or explicit
bounding box labels. As a result, GRIT achieves exceptional data efficiency,
requiring as few as 20 image-question-answer triplets from existing datasets.
Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to
produce coherent and visually grounded reasoning chains, showing a successful
unification of reasoning and grounding abilities.Summary
AI-Generated Summary