视频自动推理-R1：一次思考双重解答的视频自动推理系统

摘要

思维链推理已成为多模态大语言模型处理视频理解任务的重要工具，但其相对于直接回答的必要性与优势尚未得到充分探索。本文首次证明：对于强化学习训练的视频模型，直接回答方式常能达到甚至超越思维链推理的性能，尽管后者会生成逐步分析且计算成本更高。基于此发现，我们提出VideoAuto-R1视频理解框架，采用"按需推理"策略。在训练阶段，该方法遵循"思考一次，回答两次"范式：模型首先生成初始答案，随后进行推理，最终输出修正答案。两种答案均通过可验证奖励机制进行监督。在推理阶段，模型根据初始答案的置信度决定是否启动推理流程。在视频问答与定位基准测试中，VideoAuto-R1以显著提升的效率实现最先进准确率，平均响应长度缩减约3.3倍（如从149个标记降至44个）。此外，我们观察到感知型任务的思维模式激活率较低，而推理密集型任务则较高。这表明基于语言的显式推理通常有益，但并非总是必需。

English

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

视频自动推理-R1：一次思考双重解答的视频自动推理系统

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

摘要

Support