ChatPaper.aiChatPaper

VIDEOP2R:從感知到推理的影片理解

VIDEOP2R: Video Understanding from Perception to Reasoning

November 14, 2025
作者: Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan
cs.AI

摘要

強化微調(RFT)作為一種由監督微調(SFT)和強化學習(RL)組成的兩階段框架,已在提升大型語言模型(LLM)推理能力方面展現出顯著成效。然而將RFT擴展至大型視訊語言模型(LVLM)仍面臨挑戰。我們提出VideoP2R——一種新穎的流程感知視訊RFT框架,通過將感知與推理建模為獨立流程來增強視訊推理能力。在SFT階段,我們開發了三步流水線來生成VideoP2R-CoT-162K,這是一個專為感知與推理設計的高品質流程感知思維鏈(CoT)資料集。在RL階段,我們引入創新的流程感知群組相對策略優化(PA-GRPO)演算法,為感知和推理分別提供獎勵機制。大量實驗表明,VideoP2R在七項視訊推理與理解基準測試中有六項達到最先進(SotA)性能。消融研究進一步驗證了我們流程感知建模與PA-GRPO的有效性,並證明模型的感知輸出能為下游推理提供充分資訊。
English
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
PDF1084December 1, 2025