时序增益与空间代价：多模态大语言模型中视频微调策略再审视

摘要

多模态大语言模型（MLLMs）通常采用多阶段训练模式，其中基于视频的监督微调（Video-SFT）是提升视觉理解能力的关键步骤。然而，该方法对视觉能力细粒度演化的影响——尤其是空间与时间理解之间的平衡机制——仍缺乏深入认知。本文系统研究了Video-SFT如何重塑MLLMs的视觉能力。在不同架构、参数规模和帧采样设置下，我们观察到一致规律：Video-SFT能稳定提升视频理解性能，但在静态图像基准测试中往往收效甚微甚至出现性能衰退。进一步研究表明，这种权衡与时间预算密切相关：增加采样帧数通常能提升视频性能，但无法稳定改善静态图像理解。基于此发现，我们提出一种指令感知的混合帧策略，通过自适应分配帧数量部分缓解图像-视频权衡问题。实验结果表明，Video-SFT并非MLLMs的通用解决方案，在联合图像-视频训练中保持空间理解能力仍是核心挑战。

English

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

时序增益与空间代价：多模态大语言模型中视频微调策略再审视

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

摘要

Support