定向推理注入用於微調多語言大模型
Directional Reasoning Injection for Fine-Tuning MLLMs
October 16, 2025
作者: Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
cs.AI
摘要
多模态大型語言模型(MLLMs)正迅速發展,但其推理能力往往落後於僅基於文本的強大模型。現有彌合這一差距的方法依賴於對大規模多模態推理數據進行監督微調或強化學習,這兩種方式均需耗費大量資源。模型融合作為一種有前景的替代方案,通過在推理增強型LLMs與多模態變體之間進行參數插值來實現。然而,我們的分析表明,簡單的融合並非總是“免費午餐”:其效果在不同模型家族間差異顯著,某些模型(如LLaVA、Idefics)受益,而其他模型(如Qwen)則遭遇性能下降。為解決此問題,我們提出了針對微調的定向推理注入(DRIFT)MLLMs,這是一種輕量級方法,在梯度空間中傳遞推理知識,而不破壞多模態對齊。DRIFT預先計算推理先驗作為推理與多模態變體間的參數空間差異,隨後在多模態微調過程中利用該先驗來偏置梯度。此方法在保持標準監督微調流程簡便性的同時,實現了高效的推理知識遷移。在多模態推理基準(包括MathVista和MathVerse)上的廣泛實驗證明,DRIFT相較於簡單融合和監督微調,能持續提升推理性能,並以極低的成本匹配甚至超越訓練密集型方法。
English
Multimodal large language models (MLLMs) are rapidly advancing, yet their
reasoning ability often lags behind that of strong text-only counterparts.
Existing methods to bridge this gap rely on supervised fine-tuning over
large-scale multimodal reasoning data or reinforcement learning, both of which
are resource-intensive. A promising alternative is model merging, which
interpolates parameters between reasoning-enhanced LLMs and multimodal
variants. However, our analysis shows that naive merging is not always a "free
lunch": its effectiveness varies drastically across model families, with some
(e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance
degradation. To address this, we propose Directional Reasoning Injection for
Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning
knowledge in the gradient space, without destabilizing multimodal alignment.
DRIFT precomputes a reasoning prior as the parameter-space difference between
reasoning and multimodal variants, then uses it to bias gradients during
multimodal fine-tuning. This approach preserves the simplicity of standard
supervised fine-tuning pipelines while enabling efficient reasoning transfer.
Extensive experiments on multimodal reasoning benchmarks, including MathVista
and MathVerse, demonstrate that DRIFT consistently improves reasoning
performance over naive merging and supervised fine-tuning, while matching or
surpassing training-heavy methods at a fraction of the cost.