R-4B：通过双模式退火与强化学习激励多模态大语言模型的通用自主思考能力

摘要

具备逐步思考能力的多模态大语言模型（MLLMs）在复杂推理问题上展现了卓越的性能。然而，对于无需复杂推理即可解决的简单问题，这一思考过程显得冗余。为解决这一效率问题，我们提出了R-4B，一种具备自动思考能力的MLLM，它能够根据问题复杂度自适应地决定是否启动思考过程。R-4B的核心思想是通过双模式退火赋予模型思考与非思考两种能力，并应用双模式策略优化（BPO）来提升模型在判断是否激活思考过程时的准确性。具体而言，我们首先在精心策划的跨主题数据集上训练模型，该数据集包含思考与非思考两种模式的样本。随后，模型在改进的GRPO框架下进行第二阶段训练，其中策略模型被强制为每个输入查询生成两种模式的响应。实验结果表明，R-4B在25个具有挑战性的基准测试中达到了最先进的性能。在多数任务上，它超越了Qwen2.5-VL-7B，并在推理密集型基准测试上以更低的计算成本实现了与更大模型如Kimi-VL-A3B-Thinking-2506（16B）相当的性能。

English

Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

R-4B：通过双模式退火与强化学习激励多模态大语言模型的通用自主思考能力

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

摘要

Support