無需訓練的視頻推理
Video Reasoning without Training
October 19, 2025
作者: Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague
cs.AI
摘要
基於大型多模態模型(LMMs)的視頻推理依賴於昂貴的強化學習(RL)和冗長的思維鏈,這在訓練和推理過程中都帶來了巨大的計算開銷。此外,這些推理模型中控制思維過程的機制非常有限。在本文中,我們利用模型輸出的熵作為信號,發現高質量模型會經歷一系列微探索和微利用,這使得推理過程保持紮實(即避免模型在探索或思考答案時產生過多的隨機性)。我們進一步觀察到,一旦這種「思考」過程結束,更精確的模型會通過最終的利用階段顯著降低熵,從而表現出更好的收斂性(即更確定地收斂到解決方案軌跡)。然後,我們利用這些新穎且理論基礎的洞察,直接在推理過程中調整模型的行為,而無需使用任何RL或監督微調。具體來說,在推理過程中,我們提出的方法V-Reason(視頻推理)通過在一個小型可訓練控制器上進行幾步基於熵目標的優化步驟來調整LMM的值緩存,即無需任何數據集或RL的監督。這種調整改善了模型在推理過程中的微探索和利用行為。我們的實驗表明,與基礎指令微調模型相比,我們提出的方法在多個視頻推理數據集上取得了顯著的改進,將與RL訓練模型的差距縮小到平均準確率0.6%以內,同時提供了巨大的效率優勢:與RL模型相比,輸出標記減少了58.6%。
English
Video reasoning using Large Multimodal Models (LMMs) relies on costly
reinforcement learning (RL) and verbose chain-of-thought, resulting in
substantial computational overhead during both training and inference.
Moreover, the mechanisms that control the thinking process in these reasoning
models are very limited. In this paper, using entropy of the model's output as
a signal, we discover that the high-quality models go through a series of
micro-explorations and micro-exploitations which keep the reasoning process
grounded (i.e., avoid excessive randomness while the model is exploring or
thinking through an answer). We further observe that once this "thinking"
process is over, more accurate models demonstrate a better convergence by
reducing the entropy significantly via a final exploitation phase (i.e., a more
certain convergence towards a solution trajectory). We then use these novel,
theoretically-grounded insights to tune the model's behavior directly at
inference, without using any RL or supervised fine-tuning. Specifically, during
inference, our proposed approach called V-Reason (Video-Reason) adapts the
value cache of the LMM via a few optimization steps on a small, trainable
controller using an entropy-based objective, i.e., no supervision from any
dataset or RL is necessary. This tuning improves the model's micro-exploration
and exploitation behavior during inference. Our experiments show that our
proposed method achieves significant improvements over the base
instruction-tuned models across several video reasoning datasets, narrowing the
gap with RL-trained models to within 0.6% average accuracy without any
training, while offering massive efficiency benefits: output tokens are reduced
by 58.6% compared to the RL model.