视频R4:通过视觉反刍增强文本丰富视频的推理能力
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
November 21, 2025
作者: Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu
cs.AI
摘要
理解富含文本的视频需要反复审视微小、瞬态的文本线索。然而大多数视频问答模型依赖固定帧的单次感知,导致在细粒度证据上产生幻觉和错误。受人类暂停播放、放大关键区域并重读的启发,我们提出Video-R4(基于视觉反刍的文本视频推理增强框架),这是一种通过视觉反刍进行推理的视频大语言模型:迭代选择帧、放大信息区域、重新编码检索像素并更新推理状态。我们构建了两个包含可执行反刍轨迹的数据集:用于监督训练的Video-R4-CoT-17k和用于强化学习的Video-R4-RL-30k。提出多阶段反刍学习框架,通过监督微调和基于GRPO的强化学习,逐步微调70亿参数大模型以掌握原子视觉操作与混合操作。Video-R4-7B在M4-ViteVQA上达到最先进水平,并能泛化至多页文档问答、幻灯片问答及通用视频问答,证明迭代式反刍是实现像素级多模态推理的有效范式。
English
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.