重新思考影片的思維鏈推理
Rethinking Chain-of-Thought Reasoning for Videos
December 10, 2025
作者: Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
cs.AI
摘要
思維鏈推理在自然語言處理領域解決複雜任務方面成效卓著,而近期多模態大型語言模型更將此範式延伸至影片推理領域。然而,這些模型通常依賴冗長的推理鏈與大量輸入視覺標記。基於基準研究的實證觀察,我們提出假設:結合精簡視覺標記的簡潔推理足以實現有效的影片推理。為驗證此假設,我們設計並驗證了一套高效的訓練後處理與推理框架,能增強影片多模態模型的推理能力。該框架使模型能對壓縮視覺標記進行運算,並在生成答案前建立簡明推理軌跡。實驗結果顯示,優化後的模型不僅顯著提升推理效率,在多項基準測試中展現競爭力,更無需依賴人工思維鏈註解或監督式微調。綜合而言,我們的研究表明,類人的冗長思維鏈推理或許非通用影片推理的必要條件,而簡潔推理既能保持效能又可提升效率。相關程式碼將於 https://github.com/LaVi-Lab/Rethink_CoT_Video 公開釋出。
English
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.