文本推理能否提升多模态大语言模型在细粒度视觉分类任务中的表现?
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
January 11, 2026
作者: Jie Zhu, Yiyang Su, Xiaoming Liu
cs.AI
摘要
多模态大语言模型(MLLM)虽展现出强大的通用能力,但在细粒度视觉分类(FGVC)任务上仍存在明显不足。FGVC作为核心感知任务,需要细微的视觉辨别能力,对众多现实应用至关重要。针对数学、编程等高难度任务,思维链(CoT)推理已成为提升性能的常用策略。然而多项前期研究表明,CoT反而会损害视觉感知任务的性能。这些研究虽从不同角度探讨了该问题,但尚未揭示CoT削弱感知性能的根本原因。我们通过零样本评估与多训练范式的系统化重检验,发现核心矛盾在于:CoT引发的性能下降主要受推理长度驱动,即文本推理越长,分类准确率持续降低。我们将此现象命名为“思维成本”。基于该发现,我们提出两项关键贡献:(1)\alg——面向多奖励优化的通用即插即用归一化方法,可平衡异构奖励信号;(2)ReFine-RFT框架——结合集成奖励与\alg机制,在约束推理长度的同时提供密集的准确性反馈。大量实验验证了我们发现的普适性及ReFine-RFT的有效性,该框架在FGVC基准测试中实现了最先进的性能。代码与模型已开源:https://github.com/jiezhu23/ReFine-RFT{项目链接}。
English
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at https://github.com/jiezhu23/ReFine-RFT{Project Link}.