ViMU：視頻隱喻理解基準測試

摘要

任何新媒體一旦出現，其用途便不僅止於傳遞顯性內容。它所承載的資訊通常運作於兩個層次：一是直接呈現的內容，二是其下的潛臺詞——創作者試圖透過媒體傳達的隱含意念與意圖。同樣地，自影像技術普及以來，影片不僅作為記錄與傳遞視覺資訊的強大工具，更成為承載情感、態度及社會意義的載體，而這些往往難以明確言說。因此，許多影片的真正意義並非僅存於畫面上呈現的內容，而常蘊含於脈絡、表達風格以及觀者的社會經驗之中。此類影片潛臺詞的部分形式帶有幽默色彩，也有些則蘊含諷刺、嘲弄或批判。這些隱含意義在不同文化背景與社會群體間，更可能產生迥異的詮釋。然而，現有大多數影片理解模型仍主要專注於字面視覺理解，例如辨識物件、動作或時間關係，缺乏系統化理解影片中隱喻、諷刺及社會含義的能力。為填補此一缺口，我們提出 ViMU，這是首個系統性評估前沿模型理解影片潛臺詞能力的基準。ViMU 旨在評測影片理解模型是否能超越字面感知，推斷隱含意義，同時將其詮釋奠基於多模態證據，並回答開放式與選擇題。重要的是，所有問題均設計為不提供提示，確保模型在作答前無法獲知關鍵證據。

English

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.