R-4B：透過雙模式退火與強化學習激勵多模態大語言模型的通用自動思考能力

摘要

配備逐步思考能力的多模態大型語言模型（MLLMs）在處理複雜推理問題時展現了卓越的性能。然而，對於無需複雜推理即可解決的簡單問題，這一思考過程顯得冗餘。為解決此效率問題，我們提出了R-4B，一種具備自動思考能力的MLLM，它能根據問題複雜度自適應地決定何時進行思考。R-4B的核心思想是通過雙模式退火賦予模型思考與非思考兩種能力，並應用雙模式策略優化（BPO）來提升模型在判斷是否啟動思考過程時的準確性。具體而言，我們首先在一個精心策劃、涵蓋多主題的數據集上訓練模型，該數據集包含來自思考與非思考模式的樣本。隨後，模型在改進的GRPO框架下進行第二階段訓練，其中策略模型被強制為每個輸入查詢生成來自兩種模式的響應。實驗結果顯示，R-4B在25個具有挑戰性的基準測試中達到了最先進的性能。在多數任務中，它超越了Qwen2.5-VL-7B，並在推理密集型基準測試上以更低的計算成本實現了與Kimi-VL-A3B-Thinking-2506（16B）等更大模型相當的性能。

English

Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

R-4B：透過雙模式退火與強化學習激勵多模態大語言模型的通用自動思考能力

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

摘要

Support