REVISOR:超越文本反思,迈向长视频理解中的多模态内省推理
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
November 17, 2025
作者: Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan
cs.AI
摘要
依赖纯文本重思考机制的自反思方法在多数多模态任务中表现良好。然而当直接应用于长视频理解场景时,这些方法暴露出明显局限性。其根本原因在于两点:(1)长视频理解涉及更丰富且动态变化的视觉输入,仅对文本信息进行重思考不足以为继,必须建立专门针对视觉信息的再思考机制;(2)纯文本反思机制缺乏跨模态交互能力,无法在反思过程中充分融合视觉信息。基于这些发现,我们提出REVISOR(面向反射性视觉片段推理)框架——一种新型工具增强型多模态反思方案。该框架使多模态大语言模型能够协同构建跨文本与视觉模态的内省式反思流程,显著提升其对长视频的理解推理能力。为确保REVISOR在强化学习中能准确审视与问题高度相关的视频片段,我们设计了双归因解耦奖励机制。该机制融入GRPO训练策略后,可强制对齐模型推理与所选视频证据间的因果关系。值得注意的是,REVISOR框架无需额外监督微调或外部模型辅助,即可显著增强多模态大语言模型的长视频理解能力,在VideoMME、LongVideoBench、MLVU和LVBench四个基准测试中均取得显著效果。
English
Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.