AVMeme评测:面向大模型语境与文化认知的多模态多语言多文化基准
AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking
January 25, 2026
作者: Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, Yicong Chen, Arsalan Firoozi, Gavin Mischler, Sukru Samet Dindar, Richard Antonello, Linyang He, Tsun-An Hsieh, Xulin Fan, Yulun Wu, Yuesheng Ma, Chaitanya Amballa, Weixiong Chen, Jiarui Hai, Ruisi Li, Vishal Choudhari, Cong Han, Yinghao Aaron Li, Adeen Flinker, Mounya Elhilali, Emmanouil Benetos, Mark Hasegawa-Johnson, Romit Roy Choudhury, Nima Mesgarani
cs.AI
摘要
网络音视频片段通过随时间变化的声音与动态传递意义,其内涵远超纯文本所能承载。为探究AI模型能否在人类文化语境下理解此类信号,我们推出AVMeme Exam——一个由人工精心筛选的评测基准,涵盖千余个标志性网络声音与视频,涉及语音、歌曲、音乐及音效等多种类型。每个模因配有专属问答,评估从表层内容到语境情感、从使用方式到世界知识的理解层级,同时附带原始年份、文字转写、内容摘要及敏感度等元数据。我们基于该基准系统化评估了前沿多模态大语言模型与人类参与者的表现。研究结果揭示了一个持续性局限:当前模型在无文本音乐与音效任务中表现欠佳,相较于表层内容理解,其在文化语境下的思维能力明显不足。这些发现凸显了人类对齐多模态智能的关键短板,呼吁开发能够超越视听表象、实现情境化与文化化感知的新模型。项目页面:avmemeexam.github.io/public
English
Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public