驯服幻觉:通过反事实视频生成增强多模态大模型的视频理解能力
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
December 30, 2025
作者: Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
cs.AI
摘要
多模态大语言模型(MLLMs)在视频理解领域取得了显著进展,但其存在一个关键缺陷:对语言先验的过度依赖容易引发视觉信息失真的幻觉现象,尤其在处理违背常识的反事实视频时更为突出。这一局限源于文本与视频数据间的内在不平衡,而反事实数据采集与标注的高成本使得该问题难以解决。为此,我们提出DualityForge——一种基于可控扩散模型的反事实数据合成框架,通过视频编辑技术将真实视频转化为反事实场景。该框架通过将结构化上下文信息嵌入视频编辑与问答生成流程,自动生成高质量的问答对及原始-编辑视频对,以支持对比训练。基于此,我们构建了大规模视频数据集DualityVidQA,专门用于降低MLLMs的幻觉现象。此外,为充分利用配对数据的对比特性,我们提出对偶归一化优势训练(DNA-Train),采用监督微调-强化学习两阶段训练策略,其中强化学习阶段应用配对间ℓ₁优势归一化,实现更稳定高效的策略优化。在DualityVidQA测试集上的实验表明,我们的方法能显著降低模型在反事实视频上的幻觉,相比Qwen2.5-VL-7B基线模型相对提升24.0%。此外,本方法在幻觉评测与通用能力基准测试中均取得显著提升,展现出强大的泛化能力。我们将开源数据集与代码。
English
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise ell_1 advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.