ChatPaper.aiChatPaper

影片推理強化框架Video-R4:基於視覺反芻機制的文本密集型影片理解

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

November 21, 2025
作者: Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu
cs.AI

摘要

理解富含文字的影片需要反覆檢視細小、短暫的文本線索,然而現有影片問答模型多依賴固定畫面的單次感知,導致在細粒度證據上出現幻覺與失誤。受人類暫停播放、放大關鍵區域並重複閱讀的啟發,我們提出Video-R4(透過視覺反芻強化文本影片推理),這是一種能執行視覺反芻的影片推理大型多模態模型:透過迭代選擇畫面、放大資訊密集區域、重新編碼檢索像素並更新推理狀態。我們建構了兩個包含可執行反芻軌跡的資料集:用於監督式學習的Video-R4-CoT-17k與用於強化學習的Video-R4-RL-30k。提出多階段反芻學習框架,透過監督微調和基於GRPO的強化學習,逐步微調70億參數模型以掌握原子視覺操作與混合視覺操作。Video-R4-7B在M4-ViteVQA達到最先進成果,並能泛用至多頁文件問答、簡報問答及通用影片問答,證實迭代反芻是實現像素級多模態推理的有效範式。
English
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.
PDF201December 1, 2025