AVMeme Exam: Um Benchmark Multimodal, Multilíngue e Multicultural para o Conhecimento Contextual, Cultural e o Raciocínio de LLMs

Resumo

Os clipes audiovisuais da Internet transmitem significado por meio de sons e movimentos variáveis no tempo, que vão além do que apenas o texto pode representar. Para examinar se os modelos de IA podem compreender tais sinais em contextos culturais humanos, apresentamos o AVMeme Exam, um benchmark curado por humanos com mais de mil sons e vídeos icônicos da Internet, abrangendo discursos, canções, músicas e efeitos sonoros. Cada meme é emparelhado com uma questão e resposta única que avalia níveis de compreensão, desde o conteúdo superficial até o contexto e emoção, uso e conhecimento mundial, juntamente com metadados como ano original, transcrição, resumo e sensibilidade. Avaliamos sistematicamente modelos de linguagem multimodal de última geração (MLLMs) juntamente com participantes humanos usando este benchmark. Nossos resultados revelam uma limitação consistente: os modelos atuais têm desempenho fraco em músicas e efeitos sonoros sem texto, e lutam para pensar em contexto e cultura em comparação com o conteúdo superficial. Essas descobertas destacam uma lacuna fundamental na inteligência multimodal alinhada ao humano e exigem modelos que possam perceber contextual e culturalmente além da superfície do que ouvem e veem. Página do projeto: avmemeexam.github.io/public

English

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public