时序增益与空间代价:多模态大语言模型中视频微调策略再审视
Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
March 18, 2026
作者: Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu
cs.AI
摘要
多模态大语言模型(MLLMs)通常采用多阶段训练模式,其中基于视频的监督微调(Video-SFT)是提升视觉理解能力的关键步骤。然而,该方法对视觉能力细粒度演化的影响——尤其是空间与时间理解之间的平衡机制——仍缺乏深入认知。本文系统研究了Video-SFT如何重塑MLLMs的视觉能力。在不同架构、参数规模和帧采样设置下,我们观察到一致规律:Video-SFT能稳定提升视频理解性能,但在静态图像基准测试中往往收效甚微甚至出现性能衰退。进一步研究表明,这种权衡与时间预算密切相关:增加采样帧数通常能提升视频性能,但无法稳定改善静态图像理解。基于此发现,我们提出一种指令感知的混合帧策略,通过自适应分配帧数量部分缓解图像-视频权衡问题。实验结果表明,Video-SFT并非MLLMs的通用解决方案,在联合图像-视频训练中保持空间理解能力仍是核心挑战。
English
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.