R1-Zero在20億參數非SFT模型上的視覺推理「頓悟時刻」

摘要

近期，DeepSeek R1展示了如何通过简单的基于规则的激励进行强化学习，促使大型语言模型自主发展出复杂的推理能力，这一过程以“顿悟时刻”为特征，即模型在训练过程中展现出自我反思并增加响应长度。然而，尝试将这一成功扩展到多模态推理时，往往难以再现这些关键特征。在本报告中，我们首次成功地在仅使用非SFT（监督微调）的2B模型上，复制了这些多模态推理的涌现特性。从Qwen2-VL-2B出发，直接在SAT数据集上应用强化学习，我们的模型在CVBench上达到了59.47%的准确率，比基础模型提升了约30%，并超过了所有SFT设置约2%。此外，我们分享了在尝试使用RL（强化学习）结合指令模型实现类似R1推理时的失败尝试与洞见，旨在揭示其中的挑战。我们的关键观察包括：（1）在指令模型上应用RL常导致推理轨迹趋于简单化；（2）单纯的长度奖励在激发推理能力方面效果不佳。项目代码已发布于https://github.com/turningpoint-ai/VisualThinker-R1-Zero。

English

Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

R1-Zero在20億參數非SFT模型上的視覺推理「頓悟時刻」

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

摘要

Support