ViMU: ビデオ比喩理解のベンチマーキング

要旨

新しいメディアが登場すると、それは単に明白な内容を伝達するためだけに利用されるわけではない。それが担う情報は通常、二つのレベルで機能する。一つは直接提示される内容であり、もう一つはその背後にあるサブテキスト、すなわち創作者がメディアを通じて伝えようとする暗黙の意図や考えである。同様に、ビデオ技術が広く普及して以来、ビデオは視覚情報を記録・伝達する強力なツールとしてだけでなく、しばしば明示的に表現することが難しい感情、態度、社会的意味を伝える媒体としても機能してきた。したがって、多くのビデオの真の意味は、画面に映るものだけにあるのではなく、文脈、表現スタイル、視聴者の社会的経験に埋め込まれていることが多い。このようなビデオのサブテキストには、ユーモアを帯びたものもあれば、皮肉、嘲笑、批判を含むものもある。これらの暗黙の意味は、文化的背景や社会集団によっても大きく異なる解釈が可能である。しかし、既存のビデオ理解モデルのほとんどは、物体、動作、時間関係の認識といった、文字通りの視覚的理解に主に焦点を当てており、ビデオに埋め込まれた比喩的、皮肉的、社会的意味を体系的に理解する能力を欠いている。このギャップを埋めるため、我々はViMUを紹介する。これは、ビデオにおける先端モデルのサブテキスト理解能力を体系的に評価するために設計された初めてのベンチマークである。ViMUは、ビデオ理解モデルが文字通りの知覚を超えて暗黙の意味を推論し、その解釈をマルチモーダルな証拠に基づいて根拠付け、自由回答および多肢選択の両方の質問に答えることができるかどうかを評価する。重要なのは、すべての質問がヒントなしで設計されており、モデルが回答する前に重要な証拠が開示されないようにしている点である。

English

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.