ChatPaper.aiChatPaper

OpenVLThinker:通過迭代自我改進探索複雜視覺語言推理的早期嘗試

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

March 21, 2025
作者: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang
cs.AI

摘要

DeepSeek-R1 的最新進展表明,通過可驗證獎勵的強化學習(RL),大型語言模型(LLMs)能夠實現複雜的推理能力,包括自我驗證和自我修正等精細行為,並顯著提升了模型在如 AIME 等挑戰性任務上的表現。受這些發現的啟發,我們的研究探討了是否能夠成功將類似的推理能力整合到大型視覺語言模型(LVLMs)中,並評估它們對多模態推理挑戰任務的影響。我們考慮了一種方法,該方法迭代地利用輕量級訓練數據的監督微調(SFT)和強化學習(RL)來進一步提升模型的泛化能力。最初,推理能力是通過使用來自多樣化視覺數據集的高質量圖像描述生成推理步驟,從純文本的 R1 模型中提煉出來的。隨後,迭代的 RL 訓練進一步增強了推理技能,每一輪 RL 改進後的模型都會生成更精煉的 SFT 數據集供下一輪使用。這一迭代過程產生了 OpenVLThinker,這是一個在 MathVista、MathVerse 和 MathVision 等挑戰性基準上持續展現出改進推理性能的 LVLM,展示了我們策略在實現穩健視覺語言推理方面的潛力。代碼、模型和數據均存放於 https://github.com/yihedeng9/OpenVLThinker。
English
Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Summary

AI-Generated Summary

PDF232March 24, 2025