시공간적 트레이드오프: 멀티모달 대규모 언어 모델의 비디오 미세 조정 재검토

초록

멀티모달 대규모 언어 모델(MLLM)은 일반적으로 여러 단계에 걸쳐 훈련되며, 비디오 기반 지도 미세 조정(Video-SFT)이 시각적 이해 능력 향상을 위한 핵심 단계로 작용합니다. 그러나 이 과정이 시각 능력의 세분화된 진화, 특히 공간적 이해와 시간적 이해 사이의 균형에 미치는 영향은 아직 명확히 규명되지 않았습니다. 본 논문에서는 Video-SFT가 MLLM의 시각 능력을 어떻게 재구성하는지 체계적으로 연구합니다. 다양한 아키텍처, 매개변수 규모, 프레임 샘플링 설정에 걸쳐 일관된 패턴을 관찰했습니다. Video-SFT는 비디오 성능을 안정적으로 향상시키지만, 정적 이미지 벤치마크에서는 제한된 개선만 이루어지거나 오히려 성능 저하가 발생하는 경우가 많습니다. 우리는 이러한 트레이드오프가 시간적 예산(temporal budget)과 밀접하게 연관되어 있음을 추가로 보여줍니다. 샘플링 프레임 수를 증가시키면 일반적으로 비디오 성능은 개선되지만, 정적 이미지 성능은 안정적으로 개선되지 않습니다. 이러한 발견에 기반하여, 우리는 프레임 수를 적응적으로 할당하고 이미지-비디오 트레이드오프를 부분적으로 완화하는 지시 인식 하이브리드 프레임 전략을 연구합니다. 우리의 결과는 Video-SFT가 MLLM에 무조건적인 이점을 제공하는 것이 아니며, 이미지-비디오 통합 훈련에서 공간적 이해 능력을 보존하는 것이 여전히 핵심 과제임을 시사합니다.

English

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

시공간적 트레이드오프: 멀티모달 대규모 언어 모델의 비디오 미세 조정 재검토

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

초록

Support