GDCNet:基于生成式差异比较的多模态讽刺检测网络
GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
January 28, 2026
作者: Shuguang Zhang, Junhong Lian, Guoxin Yu, Baoxun Xu, Xiang Ao
cs.AI
摘要
多模态讽刺检测(MSD)旨在通过建模跨模态语义不一致性来识别图文对中的讽刺现象。现有方法常利用跨模态嵌入失配检测不一致性,但当视觉与文本内容关联松散或语义间接时效果不佳。虽然近期研究采用大语言模型(LLM)生成讽刺线索,但这些生成结果固有的多样性和主观性常引入噪声。为解决这些局限,我们提出生成式差异比较网络(GDCNet)。该框架通过使用多模态大语言模型(MLLM)生成的描述性、事实导向的图像标题作为稳定语义锚点,捕捉跨模态冲突。具体而言,GDCNet计算生成的目标描述与原始文本之间的语义和情感差异,同时测量视觉-文本保真度。这些差异特征随后通过门控模块与视觉、文本表示融合,以自适应平衡模态贡献。在MSD基准上的大量实验表明,GDCNet在准确性和鲁棒性方面均优于现有方法,在MMSD2.0基准上实现了最新最优性能。
English
Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.