ChatPaper.aiChatPaper

VIDEOP2R:从感知到推理的视频理解

VIDEOP2R: Video Understanding from Perception to Reasoning

November 14, 2025
作者: Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan
cs.AI

摘要

强化微调(RFT)作为一种包含监督微调(SFT)和强化学习(RL)的两阶段框架,在提升大语言模型(LLM)推理能力方面已展现出显著成效。然而,将RFT扩展至大规模视频语言模型(LVLM)仍面临挑战。我们提出VideoP2R——一种创新的过程感知视频RFT框架,通过将感知与推理建模为独立过程来增强视频推理能力。在SFT阶段,我们开发了三步生成流程,构建了包含16.2万条高质量过程感知思维链(CoT)的数据集VideoP2R-CoT-162K;在RL阶段,我们引入了新型过程感知分组相对策略优化(PA-GRPO)算法,为感知和推理过程分别提供奖励机制。大量实验表明,VideoP2R在七项视频推理与理解基准测试中的六项达到最先进(SotA)性能。消融研究进一步验证了过程感知建模与PA-GRPO的有效性,并证明模型的感知输出能为下游推理提供充分信息支持。
English
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
PDF1084December 1, 2025