視頻大型多模態模型後訓練：深入探討視頻推理技術

摘要

视频理解代表了计算机视觉领域最具挑战性的前沿，要求模型能够推理复杂的时空关系、长期依赖以及多模态证据。近期，视频-大型多模态模型（Video-LMMs）的出现，通过将视觉编码器与强大的基于解码器的语言模型相结合，在视频理解任务中展现了卓越的能力。然而，将这些模型从基础感知系统转变为复杂推理引擎的关键阶段——训练后阶段，在现有文献中仍显得零散。本综述首次全面审视了Video-LMMs的训练后方法，涵盖了三大支柱：带有思维链的监督微调（SFT）、基于可验证目标的强化学习（RL），以及通过增强推理计算实现的测试时扩展（TTS）。我们提出了一种结构化分类法，明确了这些技术的角色、相互联系及针对视频的特定调整，解决了诸如时间定位、时空基础、长视频效率和多模态证据整合等独特挑战。通过对代表性方法的系统分析，我们综合了关键设计原则、见解和评估协议，同时识别了奖励设计、可扩展性和成本性能优化中的关键开放挑战。我们进一步整理了必要的基准、数据集和指标，以促进对训练后效果的严格评估。本综述旨在为研究人员和从业者提供一个统一的框架，以推动Video-LMM能力的进步。更多资源及更新维护于：https://github.com/yunlong10/Awesome-Video-LMM-Post-Training。

English

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

視頻大型多模態模型後訓練：深入探討視頻推理技術

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

摘要

Support