视频大模型后训练:深入探索多模态大模型的视频推理能力
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
October 6, 2025
作者: Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu
cs.AI
摘要
视频理解是计算机视觉领域最具挑战性的前沿课题,它要求模型能够推理复杂的时空关系、长期依赖关系以及多模态证据。近期出现的视频-大型多模态模型(Video-LMMs),通过将视觉编码器与强大的基于解码器的语言模型相结合,在视频理解任务中展现了卓越的能力。然而,将这些模型从基础感知系统转变为高级推理引擎的关键阶段——训练后处理,在现有文献中仍显得零散。本综述首次全面审视了Video-LMMs的训练后处理方法,涵盖三大支柱:带有思维链的监督微调(SFT)、基于可验证目标的强化学习(RL),以及通过增强推理计算实现的测试时扩展(TTS)。我们构建了一个结构化分类体系,阐明了这些技术的角色、相互联系及针对视频特性的适应性调整,解决了诸如时间定位、时空基础、长视频处理效率和多模态证据整合等独特挑战。通过对代表性方法的系统分析,我们提炼出关键设计原则、洞见及评估协议,同时指出了奖励设计、可扩展性和成本效益优化等亟待解决的开放性问题。此外,我们还精选了重要的基准测试、数据集和评价指标,以促进对训练后效果进行严格评估。本综述旨在为研究人员和实践者提供一个统一框架,以推动Video-LMM能力的进一步发展。更多资源与更新维护于:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training。
English
Video understanding represents the most challenging frontier in computer
vision, requiring models to reason about complex spatiotemporal relationships,
long-term dependencies, and multimodal evidence. The recent emergence of
Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders
with powerful decoder-based language models, has demonstrated remarkable
capabilities in video understanding tasks. However, the critical phase that
transforms these models from basic perception systems into sophisticated
reasoning engines, post-training, remains fragmented across the literature.
This survey provides the first comprehensive examination of post-training
methodologies for Video-LMMs, encompassing three fundamental pillars:
supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL)
from verifiable objectives, and test-time scaling (TTS) through enhanced
inference computation. We present a structured taxonomy that clarifies the
roles, interconnections, and video-specific adaptations of these techniques,
addressing unique challenges such as temporal localization, spatiotemporal
grounding, long video efficiency, and multimodal evidence integration. Through
systematic analysis of representative methods, we synthesize key design
principles, insights, and evaluation protocols while identifying critical open
challenges in reward design, scalability, and cost-performance optimization. We
further curate essential benchmarks, datasets, and metrics to facilitate
rigorous assessment of post-training effectiveness. This survey aims to provide
researchers and practitioners with a unified framework for advancing Video-LMM
capabilities. Additional resources and updates are maintained at:
https://github.com/yunlong10/Awesome-Video-LMM-Post-Training