LOVE-R1: 다단계 추론을 통한 적응형 줌인 메커니즘으로 장기 비디오 이해 기술 발전

초록

긴 영상 이해는 최근의 대형 비디오-언어 모델(LVLMs)에게 여전히 도전적인 과제입니다. 이는 장기간의 시간적 이해와 세밀한 공간적 인식 간의 충돌 때문입니다. 균일한 프레임 샘플링 메커니즘을 사용하는 LVLMs는 동일한 프레임 크기와 고정된 샘플링 속도로 프레임을 샘플링하기 때문에, 필연적으로 시간적 단서나 공간적 세부 사항 중 하나를 희생하게 되어 최적의 해결책을 얻기 어렵습니다. 이러한 딜레마를 완화하기 위해, 우리는 비디오 클립에 적응적으로 확대할 수 있는 LOVE-R1 모델을 제안합니다. 이 모델은 먼저 작은 해상도로 밀집 샘플링된 프레임을 제공받습니다. 만약 일부 공간적 세부 사항이 필요하다면, 모델은 핵심 시각 정보를 얻을 때까지 추론을 기반으로 큰 프레임 해상도로 관심 있는 클립을 확대할 수 있습니다. 이 전체 과정은 다단계 추론 과정으로 구현됩니다. 추론 능력을 훈련시키기 위해, 우리는 먼저 수집한 38k 고품질 CoT 데이터로 모델을 미세 조정하고, 분리된 강화 미세 조정으로 이를 강화합니다. 결과 보상이 세밀한 과정 감독을 제공할 수 없기 때문에, 우리는 다단계 추론을 여러 단일 단계 추론으로 분리하고 내부 확대 능력을 명시적으로 최적화합니다. 긴 영상 이해 벤치마크에서의 실험은 우리의 모델이 느린-빠른 적응형 프레임 샘플링 메커니즘을 통해 샘플링 밀도와 프레임 해상도 간의 훌륭한 균형을 달성하며, LOVE-R1이 4개의 일반적인 긴 영상 이해 벤치마크에서 평균 3.1% 포인트로 기준 모델인 Qwen2.5-VL을 능가함을 보여줍니다.

English

Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

LOVE-R1: 다단계 추론을 통한 적응형 줌인 메커니즘으로 장기 비디오 이해 기술 발전

LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

초록

Support