MetaphorVU：邁向隱喻視頻理解

摘要

隱喻影片在各種真實場景中廣泛用於傳達複雜概念，理解它們通常需要高階認知能力。目前針對隱喻影片理解的系統性研究不足，不僅限制了多模態大型語言模型（MLLMs）在現實世界中的應用，也阻礙了對其高階認知能力的全面評估。為填補此一缺口，我們提出 MetaphorVU-Bench，首個系統化且全面的隱喻影片理解基準。實驗結果顯示，現有的多模態大型語言模型在準確理解隱喻影片方面表現困難，遠落後於人類水準，主要原因在於跨領域映射的缺陷。受此發現啟發，我們建構了一個隱喻知識圖譜作為映射增強，並提出 MetaphorBoost，一個能在推論時持續提升效能的增強框架。我們的基準、分析與方法為未來推動多模態大型語言模型的研究提供了有用的見解與基礎。

English

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.