ChatPaper.aiChatPaper

馴服幻覺:透過反事實影片生成提升多模態大語言模型的影片理解能力

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

December 30, 2025
作者: Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
cs.AI

摘要

多模態大型語言模型(MLLMs)在影片理解領域取得了顯著進展,然而其存在關鍵弱點:過度依賴語言先驗,這可能導致視覺基礎缺失的幻覺現象,尤其在處理違反常識的反事實影片時更為明顯。此侷限性源於文本與影片資料間的固有失衡,且因反事實資料收集與標註成本高昂而難以解決。為此,我們提出DualityForke——一種創新的反事實資料合成框架,透過可控的擴散式影片編輯技術將真實影片轉化為反事實情境。該框架透過將結構化上下文資訊嵌入影片編輯與問答生成流程,自動產出高品質的問答對及原始-編輯影片對,用於對比訓練。基於此,我們構建了DualityVidQA大規模影片資料集,旨在降低MLLMs的幻覺生成。此外,為充分發揮配對資料的對比特性,我們提出對偶歸一化優勢訓練(DNA-Train)——採用監督微調與強化學習的兩階段訓練機制,其中強化學習階段應用配對間ℓ₁優勢歸一化,從而實現更穩定高效的策略優化。在DualityVidQA測試集上的實驗表明,我們的方法能顯著降低模型對反事實影片的幻覺生成,相比Qwen2.5-VL-7B基線模型獲得24.0%的相對提升。更重要的是,本方法在幻覺檢測與通用能力基準測試中均取得顯著進步,展現出強大的泛化能力。我們將開源資料集與程式碼。
English
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise ell_1 advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
PDF252January 6, 2026