MetaphorVU: 比喩的映像理解に向けて

要旨

比喩的動画は複雑な概念を伝えるために実世界の様々なシナリオで広く見られ、それらの理解には通常、高次認知能力が必要である。比喩的動画理解に関する体系的研究の欠如は、マルチモーダル大規模言語モデル（MLLMs）の実世界での応用可能性を制限するだけでなく、それらの高次認知能力の徹底的な評価を妨げている。このギャップを埋めるために、我々は比喩的動画理解に特化した初の体系的かつ包括的なベンチマークであるMetaphorVU-Benchを提案する。実験を通じて、現在のMLLMsは正確な比喩的動画理解に苦戦し、人間の水準に大きく遅れをとっていることがわかった。その主な原因は不十分な領域間マッピングである。この発見に動機づけられ、我々はマッピング拡張として比喩知識グラフを構築し、一貫した性能向上を達成する推論時強化フレームワークであるMetaphorBoostを提案する。我々のベンチマーク、分析、手法は、MLLMsの高度化に関する将来の研究に有用な洞察と基盤を提供する。

English

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.