REVISOR: Para Além da Reflexão Textual, Rumo ao Raciocínio Introspectivo Multimodal na Compreensão de Vídeos de Longa Duração

Resumo

Mecanismos de autorreflexão que dependem de processos de repensar puramente baseados em texto apresentam bom desempenho na maioria das tarefas multimodais. No entanto, quando aplicados diretamente a cenários de compreensão de vídeos de longa duração, exibem limitações evidentes. As razões fundamentais para isso residem em dois pontos: (1) a compreensão de vídeos de longa duração envolve input visual mais rico e dinâmico, significando que repensar apenas a informação textual é insuficiente e necessita de um processo de repensar adicional especificamente direcionado à informação visual; (2) mecanismos de reflexão puramente baseados em texto carecem de capacidades de interação cross-modal, impedindo-os de integrar plenamente a informação visual durante a reflexão. Motivados por essas perceções, propomos o REVISOR (REflective VIsual Segment Oriented Reasoning), uma nova estrutura para reflexão multimodal aumentada por ferramentas. O REVISOR permite que os MLLMs construam colaborativamente processos de reflexão introspetiva através das modalidades textual e visual, melhorando significativamente a sua capacidade de raciocínio para a compreensão de vídeos de longa duração. Para garantir que o REVISOR possa aprender a rever com precisão segmentos de vídeo altamente relevantes para a questão durante o aprendizado por reforço, concebemos o mecanismo de Recompensa Desacoplada por Dupla Atribuição (DADR). Integrado na estratégia de treino GRPO, este mecanismo impõe um alinhamento causal entre o raciocínio do modelo e a evidência videográfica selecionada. De forma notável, a estrutura REVISOR melhora significativamente a capacidade de compreensão de vídeos de longa duração dos MLLMs sem exigir afinação supervisionada suplementar ou modelos externos, alcançando resultados impressionantes em quatro benchmarks, incluindo VideoMME, LongVideoBench, MLVU e LVBench.

English

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.