ViMU: 영상 은유적 이해 벤치마크

초록

새로운 매체가 등장하면, 그것은 단지 명시적 내용의 전송을 위해서만 사용되지 않는다. 매체가 전달하는 정보는 일반적으로 두 가지 수준에서 작동한다. 하나는 직접 제시되는 내용이고, 다른 하나는 그 이면에 깔린 하위 텍스트, 즉 창작자가 매체를 통해 전달하고자 하는 암묵적 생각과 의도이다. 마찬가지로, 영상 기술이 널리 보급된 이후로, 비디오는 시각 정보를 기록하고 전달하는 강력한 도구로서뿐만 아니라, 명시적으로 표현하기 어려운 경우가 많은 감정, 태도, 사회적 의미를 전달하는 수단으로도 기능해 왔다. 따라서 많은 비디오의 진정한 의미는 화면에 보여지는 것에만 있는 것이 아니라, 종종 맥락, 표현 스타일, 시청자의 사회적 경험 속에 내재되어 있다. 이러한 비디오 하위 텍스트의 일부 형태는 유머러스한 반면, 다른 형태는 아이러니, 조롱, 비판을 담고 있다. 이러한 암묵적 의미는 문화적 배경과 사회 집단에 따라 매우 다르게 해석될 수도 있다. 그러나 기존의 대부분의 비디오 이해 모델은 여전히 객체, 행동, 또는 시간적 관계 인식과 같은 문자 그대로의 시각적 이해에 주로 초점을 맞추고 있으며, 비디오에 내재된 은유적, 아이러니적, 사회적 의미를 체계적으로 이해하는 능력이 부족하다. 이러한 공백을 메우기 위해, 우리는 최첨단 모델의 비디오 내 하위 텍스트 이해 능력을 체계적으로 평가하도록 설계된 최초의 벤치마크인 ViMU를 소개한다. ViMU는 비디오 이해 모델이 문자 그대로의 인식을 넘어 암묵적 의미를 추론하고, 그 해석을 다중 양식 증거에 기반하여 개방형 및 객관식 질문에 모두 답할 수 있는지를 평가한다. 중요한 점은, 모든 질문이 힌트 없이 설계되어 모델이 답변하기 전에 핵심 증거를 알 수 없도록 보장한다는 것이다.

English

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.