ChatPaper.aiChatPaper

SAIL-RL:通过双奖励强化学习调控多模态大语言模型的思考时机与思考策略

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

November 4, 2025
作者: Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng
cs.AI

摘要

我们提出SAIL-RL——一种强化学习后训练框架,通过教导多模态大语言模型何时思考及如何思考来增强其推理能力。现有方法受限于仅关注结果的监督机制(只奖励正确答案而无法确保推理过程的合理性)和统一的思考策略(常导致简单任务过度思考而复杂任务思考不足)。SAIL-RL通过双重奖励系统应对这些挑战:思考奖励从事实依据、逻辑连贯性和答案一致性三个维度评估推理质量,判断奖励则自适应地决定应采用深度推理还是直接作答。在顶尖模型SAIL-VL2上的实验表明,SAIL-RL在4B和8B规模上均提升了推理和多模态理解基准性能,与GPT-4o等商业闭源模型相比具有竞争力,并显著减少幻觉现象,由此建立起构建更可靠、自适应MLLMs的理论框架。代码将发布于https://github.com/BytedanceDouyinContent/SAIL-RL。
English
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
PDF32December 2, 2025