方向性推理注入用于微调多模态大语言模型
Directional Reasoning Injection for Fine-Tuning MLLMs
October 16, 2025
作者: Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
cs.AI
摘要
多模态大语言模型(MLLMs)正迅速发展,但其推理能力往往落后于强大的纯文本模型。现有弥合这一差距的方法依赖于大规模多模态推理数据的监督微调或强化学习,这两种方式均需耗费大量资源。模型融合作为一种有前景的替代方案,通过在推理增强型大语言模型与多模态变体之间进行参数插值来实现。然而,我们的分析表明,简单的融合并非总是“免费的午餐”:其效果因模型系列而异,部分模型(如LLaVA、Idefics)受益,而其他模型(如Qwen)则出现性能下降。为解决这一问题,我们提出了面向微调的方向性推理注入(DRIFT)MLLMs,这是一种轻量级方法,在梯度空间中传递推理知识,同时不破坏多模态对齐。DRIFT预先计算推理先验作为推理与多模态变体间的参数空间差异,随后在微调过程中利用该先验引导梯度。此方法既保持了标准监督微调流程的简洁性,又实现了高效的推理知识迁移。在包括MathVista和MathVerse在内的多模态推理基准上的广泛实验表明,DRIFT在推理性能上持续优于简单融合与监督微调,同时以极低的成本匹配甚至超越了训练密集型方法。
English
Multimodal large language models (MLLMs) are rapidly advancing, yet their
reasoning ability often lags behind that of strong text-only counterparts.
Existing methods to bridge this gap rely on supervised fine-tuning over
large-scale multimodal reasoning data or reinforcement learning, both of which
are resource-intensive. A promising alternative is model merging, which
interpolates parameters between reasoning-enhanced LLMs and multimodal
variants. However, our analysis shows that naive merging is not always a "free
lunch": its effectiveness varies drastically across model families, with some
(e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance
degradation. To address this, we propose Directional Reasoning Injection for
Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning
knowledge in the gradient space, without destabilizing multimodal alignment.
DRIFT precomputes a reasoning prior as the parameter-space difference between
reasoning and multimodal variants, then uses it to bias gradients during
multimodal fine-tuning. This approach preserves the simplicity of standard
supervised fine-tuning pipelines while enabling efficient reasoning transfer.
Extensive experiments on multimodal reasoning benchmarks, including MathVista
and MathVerse, demonstrate that DRIFT consistently improves reasoning
performance over naive merging and supervised fine-tuning, while matching or
surpassing training-heavy methods at a fraction of the cost.