ChatPaper.aiChatPaper

VerIPO:透過驗證器引導的迭代策略優化培養視頻大語言模型的長程推理能力

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

May 25, 2025
作者: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
cs.AI

摘要

將強化學習(Reinforcement Learning, RL)應用於視頻大型語言模型(Video-LLMs)在複雜視頻推理方面展現出顯著潛力。然而,流行的強化微調(Reinforcement Fine-Tuning, RFT)方法,如基於結果的群組相對策略優化(Group Relative Policy Optimization, GRPO),受到數據準備瓶頸(例如噪音或高成本)的限制,並且在長鏈思維(Chain-of-Thoughts, CoTs)質量和下游性能方面表現出不穩定的改進。為解決這些限制,我們提出了VerIPO,一種驗證器引導的迭代策略優化方法,旨在逐步提升視頻LLMs生成深度、長期推理鏈的能力。其核心組件是Rollout-Aware Verifier,位於GRPO和直接偏好優化(Direct Preference Optimization, DPO)訓練階段之間,形成GRPO-Verifier-DPO訓練循環。該驗證器利用小型LLMs作為評判者來評估rollouts的推理邏輯,從而構建高質量的對比數據,包括反思性和上下文一致的CoTs。這些精心挑選的偏好樣本驅動了高效的DPO階段(比GRPO快7倍),顯著提升了推理鏈的質量,特別是在長度和上下文一致性方面。此訓練循環受益於GRPO的廣泛搜索和DPO的定向優化。實驗結果表明:1)與標準GRPO變體相比,顯著更快且更有效的優化,產生了更優的性能;2)我們訓練的模型超越了直接推理的大規模指令微調Video-LLMs,在多樣化的視頻推理任務中生成長且上下文一致的CoTs;3)我們經過一次迭代的模型超越了強大的LMMs(例如Kimi-VL)和長推理模型(例如Video-R1),凸顯了其有效性和穩定性。
English
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

Summary

AI-Generated Summary

PDF415May 28, 2025