AVMeme測驗:針對大型語言模型情境與文化知識及思維能力的多模態多語言多文化基準測試
AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking
January 25, 2026
作者: Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, Yicong Chen, Arsalan Firoozi, Gavin Mischler, Sukru Samet Dindar, Richard Antonello, Linyang He, Tsun-An Hsieh, Xulin Fan, Yulun Wu, Yuesheng Ma, Chaitanya Amballa, Weixiong Chen, Jiarui Hai, Ruisi Li, Vishal Choudhari, Cong Han, Yinghao Aaron Li, Adeen Flinker, Mounya Elhilali, Emmanouil Benetos, Mark Hasegawa-Johnson, Romit Roy Choudhury, Nima Mesgarani
cs.AI
摘要
網路影音片段透過隨時間變化的聲音與動態傳遞意義,其承載的資訊量已超越純文字所能表達的範疇。為探究人工智慧模型能否在人類文化脈絡中理解此類信號,我們推出AVMeme測驗——一個由人工精選的基準數據集,收錄逾千個標誌性網路聲音與影片,涵蓋語音、歌曲、音樂及音效等類型。每個迷因皆配備獨特的問答題組,評估層面從表層內容到上下文理解,從情感辨識到使用情境與世界知識,並附帶原始年份、轉錄文本、內容摘要及敏感度等元數據。我們以此基準系統性評估頂尖多模態大語言模型與人類受試者的表現。研究結果揭示一項持續存在的局限:當前模型在處理無文字音樂與音效時表現欠佳,且相較於表層內容理解,模型在文化情境中的思維能力明顯不足。這些發現凸顯了人類對齊多模態智能的關鍵缺口,呼籲開發能超越視聽表層、感知文化上下文的新型模型。專案頁面:avmemeexam.github.io/public
English
Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public