ChatPaper.aiChatPaper

VideoAuto-R1:透過單次思考雙重應答實現影片自動推理

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

January 8, 2026
作者: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
cs.AI

摘要

思維鏈推理已成為多模態大型語言模型在影片理解任務中的強大工具。然而,其必要性及相較於直接回答的優勢仍待深入探討。本文首先證明,對於透過強化學習訓練的影片模型而言,直接回答往往能達到甚至超越思維鏈的效能,儘管思維鏈需以更高計算成本產生逐步分析。基於此發現,我們提出VideoAuto-R1影片理解框架,採用「必要時才推理」的策略。在訓練階段,我們的方法遵循「思考一次,回答兩次」的範式:模型首先生成初始答案,接著進行推理,最後輸出覆核後的答案。兩種答案皆透過可驗證的獎勵機制進行監督。在推理階段,模型根據初始答案的信賴分數決定是否啟動推理流程。在影片問答與定位基準測試中,VideoAuto-R1以顯著提升的效率達成最先進的準確度,平均回應長度縮減約3.3倍(例如從149個標記降至44個)。此外,我們觀察到在感知導向任務中思考模式啟動率較低,而在推理密集型任務中啟動率較高。這表明基於語言的顯式推理通常有益,但並非總是必要。
English
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
PDF150January 10, 2026