MetaphorVU: 은유적 영상 이해를 향하여

초록

메타포 비디오는 복잡한 아이디어를 전달하기 위해 다양한 실제 시나리오에서 널리 사용되며, 이를 이해하는 데는 일반적으로 고차원 인지 능력이 필요하다. 메타포 비디오 이해에 관한 체계적인 연구의 부재는 대규모 멀티모달 언어 모델(MLLM)의 실제 적용 가능성을 제약할 뿐만 아니라, 해당 모델의 고차원 인지 능력에 대한 철저한 평가를 저해한다. 이러한 격차를 해소하기 위해, 우리는 메타포 비디오 이해에 특화된 최초의 체계적이고 포괄적인 벤치마크인 MetaphorVU-Bench를 제안한다. 실험을 통해 현재의 MLLM은 정확한 메타포 비디오 이해에 어려움을 겪으며, 주로 결함 있는 교차 도메인 매핑으로 인해 인간 수준에 크게 미치지 못함을 발견했다. 이 발견에 기반하여, 우리는 매핑 증강을 위한 메타포 지식 그래프를 구축하고, 추론 시점 개선 프레임워크인 MetaphorBoost를 제안하여 일관된 성능 향상을 달성하였다. 우리의 벤치마크, 분석, 방법은 MLLM 발전을 위한 향후 연구에 유용한 통찰과 기반을 제공한다.

English

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.