ビデオ-LMM ポストトレーニング: 大規模マルチモーダルモデルを用いたビデオ推論の詳細分析

要旨

ビデオ理解は、コンピュータビジョンにおいて最も挑戦的なフロンティアであり、複雑な時空間的関係、長期的な依存関係、およびマルチモーダルな証拠についてモデルが推論することを要求する。最近登場したビデオ大規模マルチモーダルモデル（Video-LMMs）は、視覚エンコーダを強力なデコーダベースの言語モデルと統合し、ビデオ理解タスクにおいて顕著な能力を示している。しかし、これらのモデルを基本的な知覚システムから洗練された推論エンジンへと変革する重要な段階であるポストトレーニングは、文献全体で断片的にしか扱われていない。本調査は、Video-LMMsのポストトレーニング手法を初めて包括的に検証し、チェーン・オブ・ソートを用いた教師あり微調整（SFT）、検証可能な目的からの強化学習（RL）、および強化された推論計算によるテストタイムスケーリング（TTS）という3つの基本柱を網羅する。これらの技術の役割、相互接続、およびビデオ特有の適応を明確にする構造化された分類法を提示し、時間的ローカライゼーション、時空間的グラウンディング、長いビデオの効率性、マルチモーダル証拠の統合といった独自の課題に対処する。代表的な手法の系統的な分析を通じて、主要な設計原則、洞察、および評価プロトコルを統合し、報酬設計、スケーラビリティ、コストパフォーマンス最適化における重要な未解決の課題を特定する。さらに、ポストトレーニングの効果を厳密に評価するための重要なベンチマーク、データセット、およびメトリクスをキュレーションする。本調査は、研究者や実務者にVideo-LMMの能力を進展させるための統一されたフレームワークを提供することを目的としている。追加リソースと更新情報は以下で維持されている：https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

English

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

ビデオ-LMM ポストトレーニング: 大規模マルチモーダルモデルを用いたビデオ推論の詳細分析

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

要旨

Support