通过自增强对比对齐缓解多模态大模型中的物体与动作幻觉
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
December 4, 2025
作者: Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
cs.AI
摘要
近期多模态大语言模型(MLLMs)的发展展现了其为输入视频生成描述性字幕的强大能力。然而,这些模型在生成描述时存在事实性错误,导致严重的幻觉问题。虽然已有研究探索缓解静态图像的幻觉现象,但如何同时消减动态视频中的视觉物体幻觉与时间动作幻觉,仍是亟待解决的挑战性任务。为此,我们提出了一种自增强对比对齐框架(SANTA),通过排除虚假关联并强化对视觉事实的关注,确保物体与动作描述的忠实性。该框架采用幻觉自增强机制,识别MLLM中潜在的幻觉内容,并将原始字幕转化为对比负样本。此外,我们开发了轨迹-短语对比对齐方法,将区域物体和关系引导的动作与其对应的视觉短语及时态短语进行匹配。大量实验表明,SANTA在缓解物体与动作幻觉方面优于现有方法,在幻觉检测基准测试中取得了卓越性能。
English
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.