重新思考视频中的思维链推理
Rethinking Chain-of-Thought Reasoning for Videos
December 10, 2025
作者: Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
cs.AI
摘要
思维链推理在自然语言处理领域的复杂任务求解中取得了显著成功,而近期出现的多模态大语言模型将这一范式扩展至视频推理领域。然而,现有模型通常依赖冗长的推理链和大量输入视觉标记。基于基准研究的实证观察,我们提出假设:结合精简视觉标记的简洁推理足以实现有效的视频推理。为验证该假设,我们设计并验证了一种高效的后训练与推理框架,可增强视频多模态大语言模型的推理能力。该框架使模型能够基于压缩视觉标记进行操作,并在回答问题前生成简明的推理轨迹。优化后的模型不仅实现了推理效率的大幅提升,在多样化基准测试中展现出竞争优势,还避免了对人工思维链标注或监督微调的依赖。综合实验结果表明,类人的冗长思维链推理可能并非通用视频推理的必要条件,而简洁推理既能保证效果又可提升效率。相关代码将在https://github.com/LaVi-Lab/Rethink_CoT_Video发布。
English
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.