VerIPO: 検証器誘導型反復ポリシー最適化によるビデオLLMの長文推論能力の育成

要旨

ビデオ大規模言語モデル（Video-LLMs）に強化学習（RL）を適用することは、複雑なビデオ推論において大きな可能性を示しています。しかし、結果ベースのグループ相対ポリシー最適化（GRPO）のような人気のある強化学習ファインチューニング（RFT）手法は、データ準備のボトルネック（例：ノイズや高コスト）に制限され、長い連鎖思考（CoTs）の品質や下流タスクの性能において不安定な改善しか見られません。これらの制限に対処するため、我々はVerIPO（Verifier-guided Iterative Policy Optimization）を提案します。これは、ビデオLLMsが深く長期的な推論連鎖を生成する能力を段階的に向上させることを目的とした手法です。その中核となるのは、GRPOと直接選好最適化（DPO）のトレーニングフェーズの間に位置するRollout-Aware Verifierで、GRPO-Verifier-DPOトレーニングループを形成します。この検証器は、小規模なLLMsを裁判官として活用し、ロールアウトの推論ロジックを評価することで、反射的で文脈的に一貫したCoTsを含む高品質な対照データを構築します。これらの選好サンプルは、効率的なDPOステージ（GRPOよりも7倍高速）を駆動し、特に長さと文脈的一貫性において、推論連鎖の品質を顕著に向上させます。このトレーニングループは、GRPOの広範な探索とDPOのターゲットを絞った最適化の利点を享受します。実験結果は以下のことを示しています：1）標準的なGRPOバリアントと比較して、大幅に高速かつ効果的な最適化が行われ、優れた性能を発揮すること；2）我々のトレーニング済みモデルは、大規模な指示チューニングされたVideo-LLMsの直接推論を上回り、多様なビデオ推論タスクにおいて長く文脈的に一貫したCoTsを生成すること；3）1回のイテレーションで強力なLMMs（例：Kimi-VL）や長い推論モデル（例：Video-R1）を上回り、その有効性と安定性を強調しています。

English

Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

VerIPO: 検証器誘導型反復ポリシー最適化によるビデオLLMの長文推論能力の育成

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

要旨

Support