论强化学习微调视觉语言模型的鲁棒性与思维链一致性
On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs
February 13, 2026
作者: Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal
cs.AI
摘要
强化学习(RL)微调已成为提升大语言模型(LLM)在推理密集型任务表现的关键技术,这一成功经验正推动其向视觉语言模型(VLM)领域延伸。尽管经过RL微调的VLM在视觉推理基准测试中表现提升,但其仍存在视觉基础薄弱、幻觉问题以及过度依赖文本线索的缺陷。我们发现,简单的受控文本扰动(如误导性图像描述或错误的思维链轨迹)会显著削弱模型的鲁棒性与置信度,且当考虑开源多模态推理模型的思维链一致性时,这种负面影响更为突出。基于熵的度量指标进一步表明,这些扰动会重塑模型对正确选项的不确定性与概率分布,暴露出模型特有的校准偏差趋势。为深入探究这些脆弱性,我们进一步分析RL微调动态,揭示了准确性与忠实度之间的权衡:微调虽能提高基准准确率,但可能同时削弱伴随思维链的可靠性及其对语境变化的适应力。尽管对抗性增强能提升鲁棒性,但其本身无法阻止忠实度偏移。引入忠实度感知奖励机制可恢复答案与推理间的一致性,但当与增强技术结合时,训练易陷入捷径策略的困境,鲁棒性仍难以保障。这些发现共同揭示了仅以准确性为评估标准的局限性,呼吁建立同步强调正确性、鲁棒性及视觉基础推理忠实度的训练与评估框架。
English
Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.