LOVE-R1: マルチステップ推論による適応的ズームイン機構を用いた長尺動画理解の進展

要旨

長時間動画の理解は、最近の大規模ビデオ言語モデル（LVLM）にとって依然として課題となっています。これは、長期的な時間的理解と詳細な空間的知覚の間の矛盾によるものです。均一なフレームサンプリングメカニズムを持つLVLMは、等しいフレームサイズと固定サンプリングレートでフレームをサンプリングするため、時間的な手がかりか空間的な詳細のいずれかを犠牲にせざるを得ず、最適ではない解決策をもたらします。このジレンマを緩和するため、我々はLOVE-R1を提案します。このモデルは、動画クリップに適応的にズームインすることができます。まず、モデルには高密度にサンプリングされたが小さな解像度のフレームが提供されます。もし空間的な詳細が必要であれば、モデルはその推論に基づいて興味のあるクリップに大きなフレーム解像度でズームインし、重要な視覚情報が得られるまで続けます。このプロセス全体は、多段階の推論プロセスとして実装されています。推論能力を訓練するために、まず我々が収集した38kの高品質なCoTデータでモデルをファインチューニングし、分離された強化学習ファインチューニングで強化します。結果の報酬は細かいプロセス監視を提供できないため、多段階推論を複数の単一段階推論に分離し、内部のズームイン能力を明示的に最適化します。長時間動画理解ベンチマークでの実験では、スローファスト適応フレームサンプリングメカニズムを持つ我々のモデルが、サンプリング密度とフレーム解像度の間の優れたトレードオフを達成し、LOVE-R1はベースラインのQwen2.5-VLを4つの一般的な長時間動画理解ベンチマークで平均3.1%ポイント上回りました。

English

Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

LOVE-R1: マルチステップ推論による適応的ズームイン機構を用いた長尺動画理解の進展

LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

要旨

Support