ChatPaper.aiChatPaper

VerIPO:通过验证器引导的迭代策略优化培养视频大语言模型的长程推理能力

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

May 25, 2025
作者: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
cs.AI

摘要

将强化学习(RL)应用于视频大语言模型(Video-LLMs)在复杂视频推理任务中展现出显著潜力。然而,流行的强化微调(RFT)方法,如基于结果的群体相对策略优化(GRPO),受限于数据准备瓶颈(如噪声或高成本),且在长链思维(CoTs)质量和下游性能提升上表现不稳定。为解决这些局限,我们提出了VerIPO,一种验证器引导的迭代策略优化方法,旨在逐步提升视频LLMs生成深度、长期推理链的能力。其核心组件是Rollout-Aware Verifier,它位于GRPO与直接偏好优化(DPO)训练阶段之间,形成GRPO-Verifier-DPO训练循环。该验证器利用小型LLMs作为评判者,评估rollouts的推理逻辑,从而构建高质量的对立数据,包括反思性和上下文一致的CoTs。这些精选的偏好样本驱动高效的DPO阶段(比GRPO快7倍),显著提升了推理链的质量,特别是在长度和上下文一致性方面。此训练循环结合了GRPO的广泛搜索与DPO的精准优化优势。实验结果表明:1)相较于标准GRPO变体,优化速度显著加快且效果更佳,带来卓越性能;2)我们训练的模型超越了大规模指令调优Video-LLMs的直接推理,在多样视频推理任务中生成长且上下文一致的CoTs;3)仅一次迭代的模型便超越了强大的LMMs(如Kimi-VL)和长推理模型(如Video-R1),凸显了其有效性和稳定性。
English
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

Summary

AI-Generated Summary

PDF385May 28, 2025