LOVE-R1:通過多步推理的自適應放大機制推進長視頻理解
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning
September 29, 2025
作者: Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng
cs.AI
摘要
長視頻理解對於當前的巨量視頻-語言模型(LVLMs)而言仍具挑戰性,這源於長時序理解與細緻空間感知之間的矛盾。採用均勻幀採樣機制的LVLMs,即以相同幀尺寸和固定採樣率進行採樣,不可避免地會犧牲時序線索或空間細節,導致次優解。為緩解這一困境,我們提出了LOVE-R1模型,該模型能夠自適應地放大視頻片段。模型首先接收高密度採樣但分辨率較低的幀。若需某些空間細節,模型可基於其推理放大感興趣的片段,直至獲取關鍵視覺信息。整個過程實現為多步推理。為訓練推理能力,我們首先在收集的38k高質量CoT數據上微調模型,並通過解耦的強化微調進行增強。由於結果獎勵無法提供細粒度的過程監督,我們將多步推理解耦為多個單步推理,並顯式優化內部放大能力。在長視頻理解基準測試中,採用慢-快自適應幀採樣機制的模型在採樣密度與幀分辨率之間取得了良好平衡,LOVE-R1在四個常見長視頻理解基準上平均超越我們的基線Qwen2.5-VL 3.1個百分點。
English
Long video understanding is still challenging for recent Large Video-Language
Models (LVLMs) due to the conflict between long-form temporal understanding and
detailed spatial perception. LVLMs with a uniform frame sampling mechanism,
which samples frames with an equal frame size and fixed sampling rate,
inevitably sacrifice either temporal clues or spatial details, resulting in
suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model
that can adaptively zoom in on a video clip. The model is first provided with
densely sampled frames but in a small resolution. If some spatial details are
needed, the model can zoom in on a clip of interest with a large frame
resolution based on its reasoning until key visual information is obtained. The
whole process is implemented as a multi-step reasoning process. To train the
reasoning ability, we first finetune the model on our collected 38k
high-quality CoT data and enhance it with decoupled reinforcement finetuning.
As outcome rewards can not provide fine-grained process supervision, we
decouple multi-step reasoning into multiple single-step reasoning and optimize
the internal zoom-in ability explicitly. Experiments on long video
understanding benchmarks show that our model with the slow-fast adaptive frame
sampling mechanism achieves a great trade-off between sampling density and
frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an
average of 3.1% points across 4 common long video understanding benchmarks.