ChatPaper.aiChatPaper

LOVE-R1:通过多步推理与自适应聚焦机制推进长视频理解

LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

September 29, 2025
作者: Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng
cs.AI

摘要

长视频理解对于当前的大型视频语言模型(LVLMs)仍具挑战性,主要源于长时态理解与精细空间感知之间的矛盾。采用统一帧采样机制的LVLMs,即以相同帧尺寸和固定采样率抽取帧,不可避免地牺牲了时间线索或空间细节,导致解决方案不尽如人意。为缓解这一困境,我们提出了LOVE-R1模型,该模型能够自适应地对视频片段进行放大处理。模型首先接收高密度采样但分辨率较低的帧;若需获取某些空间细节,模型可根据其推理过程,对感兴趣片段进行高分辨率放大,直至获取关键视觉信息。整个过程被实现为多步推理。为训练推理能力,我们首先在收集的38k高质量CoT数据上微调模型,并通过解耦的强化微调进一步增强。鉴于结果奖励无法提供细粒度的过程监督,我们将多步推理分解为多个单步推理,并显式优化内部放大能力。在长视频理解基准测试中,采用慢快自适应帧采样机制的模型在采样密度与帧分辨率之间取得了良好平衡,LOVE-R1在四个常见长视频理解基准上平均超越基线模型Qwen2.5-VL达3.1个百分点。
English
Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.
PDF52September 30, 2025