VideoHallucer：評估大型視頻語言模型中的內在和外在幻覺

摘要

最近在多模式大型語言模型（MLLMs）方面取得的進展已將其能力擴展到視頻理解。然而，這些模型常常受到「幻覺」的困擾，即生成與實際視頻內容偏離的不相關或荒謬內容。本研究介紹了VideoHallucer，這是第一個針對大型視頻語言模型（LVLMs）中幻覺檢測的全面基準。VideoHallucer將幻覺分為兩種主要類型：內在和外在，並提供進一步的子類別進行詳細分析，包括對象關係、時間、語義細節、外在事實和外在非事實幻覺。我們採用對抗性二元VideoQA方法進行全面評估，其中精心製作了基本問題和幻覺問題的配對。通過在VideoHallucer上評估十一個LVLMs，我們揭示了：i）目前大多數模型存在幻覺問題；ii）儘管擴展數據集和參數可以提高模型檢測基本視覺線索和反事實的能力，但對於檢測外在事實幻覺的效益有限；iii）現有模型更擅長檢測事實而非識別幻覺。作為副產品，這些分析進一步指導了我們的自我PEP框架的開發，在所有模型架構上實現了平均5.38％的幻覺抵抗力改善。

English

Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.

VideoHallucer：評估大型視頻語言模型中的內在和外在幻覺

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

摘要

Support