비디오-LMM 사후 학습: 대규모 멀티모달 모델을 활용한 비디오 추론 심층 분석

초록

비디오 이해는 컴퓨터 비전 분야에서 가장 도전적인 과제로, 모델이 복잡한 시공간적 관계, 장기적 의존성, 그리고 다중모달 증거에 대해 추론할 것을 요구합니다. 최근 비디오-대형 다중모달 모델(Video-LMMs)의 등장은 비디오 이해 작업에서 놀라운 능력을 보여주었습니다. 이 모델들은 시각적 인코더와 강력한 디코더 기반 언어 모델을 통합하여 개발되었습니다. 그러나 이러한 모델들을 기본적인 인지 시스템에서 정교한 추론 엔진으로 변환하는 중요한 단계인 사후 훈련(post-training)은 문헌에 걸쳐 단편적으로만 다뤄져 왔습니다. 본 조사는 Video-LMMs의 사후 훈련 방법론을 처음으로 포괄적으로 검토하며, 세 가지 기본 기둥을 포함합니다: 사고의 연쇄(chain-of-thought)를 통한 지도 미세 조정(supervised fine-tuning, SFT), 검증 가능한 목표로부터의 강화 학습(reinforcement learning, RL), 그리고 향상된 추론 계산을 통한 테스트 시간 스케일링(test-time scaling, TTS). 우리는 이러한 기술들의 역할, 상호 연결, 그리고 비디오 특화적 적응을 명확히 하는 구조적 분류체계를 제시하며, 시간적 위치 지정, 시공간적 근거, 긴 비디오 효율성, 다중모달 증거 통합과 같은 고유한 도전 과제를 다룹니다. 대표적인 방법들에 대한 체계적인 분석을 통해 주요 설계 원칙, 통찰, 평가 프로토콜을 종합하고, 보상 설계, 확장성, 비용-성능 최적화와 같은 중요한 미해결 과제를 식별합니다. 또한, 사후 훈련 효과의 엄격한 평가를 용이하게 하기 위해 필수적인 벤치마크, 데이터셋, 메트릭을 정리합니다. 본 조사는 연구자와 실무자들에게 Video-LMM 능력을 발전시키기 위한 통합된 프레임워크를 제공하고자 합니다. 추가 자료와 업데이트는 https://github.com/yunlong10/Awesome-Video-LMM-Post-Training에서 유지됩니다.

English

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

비디오-LMM 사후 학습: 대규모 멀티모달 모델을 활용한 비디오 추론 심층 분석

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

초록

Support