无需训练的视频推理
Video Reasoning without Training
October 19, 2025
作者: Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague
cs.AI
摘要
利用大型多模态模型(LMMs)进行视频推理依赖于成本高昂的强化学习(RL)和冗长的思维链,导致训练和推理过程中产生巨大的计算开销。此外,这些推理模型中控制思维过程的机制非常有限。在本文中,我们通过模型输出的熵作为信号,发现高质量模型经历了一系列微观探索和微观利用,这些过程使推理过程保持稳健(即避免模型在探索或思考答案时产生过多的随机性)。我们进一步观察到,一旦这种“思考”过程结束,更精确的模型通过最终的利用阶段显著降低熵,从而表现出更好的收敛性(即更确定地收敛到解决方案轨迹)。随后,我们利用这些新颖且理论支持的见解,在推理过程中直接调整模型行为,而无需使用任何RL或有监督微调。具体而言,在推理过程中,我们提出的方法V-Reason(视频推理)通过一个基于熵目标的小型可训练控制器进行少量优化步骤,来调整LMM的值缓存,即无需任何数据集或RL的监督。这种调优改善了模型在推理过程中的微观探索和利用行为。我们的实验表明,与基础指令调优模型相比,我们提出的方法在多个视频推理数据集上实现了显著改进,将与非训练RL模型的平均准确率差距缩小至0.6%以内,同时提供了巨大的效率优势:与RL模型相比,输出标记减少了58.6%。
English
Video reasoning using Large Multimodal Models (LMMs) relies on costly
reinforcement learning (RL) and verbose chain-of-thought, resulting in
substantial computational overhead during both training and inference.
Moreover, the mechanisms that control the thinking process in these reasoning
models are very limited. In this paper, using entropy of the model's output as
a signal, we discover that the high-quality models go through a series of
micro-explorations and micro-exploitations which keep the reasoning process
grounded (i.e., avoid excessive randomness while the model is exploring or
thinking through an answer). We further observe that once this "thinking"
process is over, more accurate models demonstrate a better convergence by
reducing the entropy significantly via a final exploitation phase (i.e., a more
certain convergence towards a solution trajectory). We then use these novel,
theoretically-grounded insights to tune the model's behavior directly at
inference, without using any RL or supervised fine-tuning. Specifically, during
inference, our proposed approach called V-Reason (Video-Reason) adapts the
value cache of the LMM via a few optimization steps on a small, trainable
controller using an entropy-based objective, i.e., no supervision from any
dataset or RL is necessary. This tuning improves the model's micro-exploration
and exploitation behavior during inference. Our experiments show that our
proposed method achieves significant improvements over the base
instruction-tuned models across several video reasoning datasets, narrowing the
gap with RL-trained models to within 0.6% average accuracy without any
training, while offering massive efficiency benefits: output tokens are reduced
by 58.6% compared to the RL model.