VerIPO:通过验证器引导的迭代策略优化培养视频大语言模型的长程推理能力
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
May 25, 2025
作者: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
cs.AI
摘要
将强化学习(RL)应用于视频大语言模型(Video-LLMs)在复杂视频推理任务中展现出显著潜力。然而,流行的强化微调(RFT)方法,如基于结果的群体相对策略优化(GRPO),受限于数据准备瓶颈(如噪声或高成本),且在长链思维(CoTs)质量和下游性能提升上表现不稳定。为解决这些局限,我们提出了VerIPO,一种验证器引导的迭代策略优化方法,旨在逐步提升视频LLMs生成深度、长期推理链的能力。其核心组件是Rollout-Aware Verifier,它位于GRPO与直接偏好优化(DPO)训练阶段之间,形成GRPO-Verifier-DPO训练循环。该验证器利用小型LLMs作为评判者,评估rollouts的推理逻辑,从而构建高质量的对立数据,包括反思性和上下文一致的CoTs。这些精选的偏好样本驱动高效的DPO阶段(比GRPO快7倍),显著提升了推理链的质量,特别是在长度和上下文一致性方面。此训练循环结合了GRPO的广泛搜索与DPO的精准优化优势。实验结果表明:1)相较于标准GRPO变体,优化速度显著加快且效果更佳,带来卓越性能;2)我们训练的模型超越了大规模指令调优Video-LLMs的直接推理,在多样视频推理任务中生成长且上下文一致的CoTs;3)仅一次迭代的模型便超越了强大的LMMs(如Kimi-VL)和长推理模型(如Video-R1),凸显了其有效性和稳定性。
English
Applying Reinforcement Learning (RL) to Video Large Language Models
(Video-LLMs) shows significant promise for complex video reasoning. However,
popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group
Relative Policy Optimization (GRPO), are limited by data preparation
bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the
quality of long chain-of-thoughts (CoTs) and downstream performance.To address
these limitations, we propose VerIPO, a Verifier-guided Iterative Policy
Optimization method designed to gradually improve video LLMs' capacity for
generating deep, long-term reasoning chains. The core component is
Rollout-Aware Verifier, positioned between the GRPO and Direct Preference
Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop.
This verifier leverages small LLMs as a judge to assess the reasoning logic of
rollouts, enabling the construction of high-quality contrastive data, including
reflective and contextually consistent CoTs. These curated preference samples
drive the efficient DPO stage (7x faster than GRPO), leading to marked
improvements in reasoning chain quality, especially in terms of length and
contextual consistency. This training loop benefits from GRPO's expansive
search and DPO's targeted optimization. Experimental results demonstrate: 1)
Significantly faster and more effective optimization compared to standard GRPO
variants, yielding superior performance; 2) Our trained models exceed the
direct inference of large-scale instruction-tuned Video-LLMs, producing long
and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our
model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long
reasoning models (e.g., Video-R1), highlighting its effectiveness and
stability.Summary
AI-Generated Summary