時間的利得、空間的コスト：マルチモーダル大規模言語モデルにおけるビデオファインチューニングの再検討

要旨

マルチモーダル大規模言語モデル（MLLM）は通常、複数の段階を経て学習され、ビデオベースの教師ありファインチューニング（Video-SFT）は視覚的理解を向上させる重要なステップとして機能する。しかし、特に空間的・時間的理解のバランスにおける、視覚能力の微細な変化へのその影響は十分に解明されていない。本論文では、Video-SFTがMLLMの視覚能力をどのように再構築するかを体系的に研究する。様々なアーキテクチャ、パラメータ規模、フレームサンプリング設定において、一貫したパターンを観察した：Video-SFTはビデオ性能を確実に向上させるが、静止画像ベンチマークでは限定的な向上またはむしろ低下をもたらすことが多い。さらに、このトレードオフは時間的予算（サンプリングするフレーム数）と密接に関連していることを示す：サンプリングフレーム数を増やすと一般にビデオ性能は向上するが、静止画像性能の信頼できる向上にはつながらない。この発見に基づき、フレーム数を適応的に割り当て、画像とビデオのトレードオフを部分的に緩和する、命令を考慮したハイブリッドフレーム戦略を検討する。我々の結果は、Video-SFTがMLLMにとって無償の利得ではなく、空間的理解の維持が画像とビデオの統合学習における中心的な課題であることを示唆している。

English

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

時間的利得、空間的コスト：マルチモーダル大規模言語モデルにおけるビデオファインチューニングの再検討

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

要旨

Support