OpenVLThinker：通过迭代自我改进探索复杂视觉-语言推理的早期尝试

摘要

DeepSeek-R1的最新进展表明，通过使用可验证奖励的强化学习（RL），大型语言模型（LLMs）能够实现复杂的推理能力，包括自我验证和自我修正等高级行为，并显著提升了在诸如AIME等挑战性任务上的表现。受这些发现的启发，我们的研究探讨了是否能够成功将类似的推理能力整合到大型视觉语言模型（LVLMs）中，并评估它们对多模态推理任务的影响。我们采用了一种方法，迭代地利用轻量级训练数据的监督微调（SFT）和强化学习（RL）来进一步提升模型的泛化能力。最初，通过使用来自多样化视觉数据集的高质量图像描述生成推理步骤，从纯文本的R1模型中提炼出推理能力。随后，迭代的RL训练进一步增强了推理技能，每一轮RL改进后的模型都会为下一轮生成更精细的SFT数据集。这一迭代过程最终产生了OpenVLThinker，一个在MathVista、MathVerse和MathVision等挑战性基准测试中持续展现出改进推理性能的LVLM，证明了我们策略在实现稳健视觉语言推理方面的潜力。代码、模型和数据均存放于https://github.com/yihedeng9/OpenVLThinker。

English

Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

OpenVLThinker：通过迭代自我改进探索复杂视觉-语言推理的早期尝试

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

摘要

Support