视频大模型后训练：深入探索多模态大模型的视频推理能力

摘要

视频理解是计算机视觉领域最具挑战性的前沿课题，它要求模型能够推理复杂的时空关系、长期依赖关系以及多模态证据。近期出现的视频-大型多模态模型（Video-LMMs），通过将视觉编码器与强大的基于解码器的语言模型相结合，在视频理解任务中展现了卓越的能力。然而，将这些模型从基础感知系统转变为高级推理引擎的关键阶段——训练后处理，在现有文献中仍显得零散。本综述首次全面审视了Video-LMMs的训练后处理方法，涵盖三大支柱：带有思维链的监督微调（SFT）、基于可验证目标的强化学习（RL），以及通过增强推理计算实现的测试时扩展（TTS）。我们构建了一个结构化分类体系，阐明了这些技术的角色、相互联系及针对视频特性的适应性调整，解决了诸如时间定位、时空基础、长视频处理效率和多模态证据整合等独特挑战。通过对代表性方法的系统分析，我们提炼出关键设计原则、洞见及评估协议，同时指出了奖励设计、可扩展性和成本效益优化等亟待解决的开放性问题。此外，我们还精选了重要的基准测试、数据集和评价指标，以促进对训练后效果进行严格评估。本综述旨在为研究人员和实践者提供一个统一框架，以推动Video-LMM能力的进一步发展。更多资源与更新维护于：https://github.com/yunlong10/Awesome-Video-LMM-Post-Training。

English

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

视频大模型后训练：深入探索多模态大模型的视频推理能力

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

摘要

Support