MetaphorVU：迈向隐喻视频理解

摘要

隐喻视频在现实场景中广泛存在，用于传达复杂概念，而理解这类视频通常需要高阶认知能力。由于缺乏对隐喻视频理解的系统性研究，这不仅制约了多模态大语言模型（MLLMs）在真实场景中的适用性，也阻碍了对其高阶认知能力的全面评估。为解决这一空白，我们提出MetaphorVU-Bench——首个系统且全面的隐喻视频理解基准。实验发现，当前MLLMs难以准确理解隐喻视频，其能力远低于人类水平，主要原因在于跨域映射存在缺陷。基于此发现，我们构建了隐喻知识图谱作为映射增强手段，并提出MetaphorBoost——一种推理时增强框架，能够持续提升模型性能。我们的基准、分析与方法为未来推动MLLMs发展的研究提供了有益洞见与基础。

English

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.