SONIC-O1: Een Real-World Benchmark voor de Evaluatie van Multimodale Grote Taalmodellen op Audiovisueel Begrip

Samenvatting

Multimodale Large Language Models (MLLM's) zijn een belangrijk aandachtspunt in recent AI-onderzoek. Het meeste eerdere werk richt zich echter op het begrijpen van statische beelden, terwijl hun vermogen om sequentiële audio-videogegevens te verwerken onderbelicht blijft. Deze leemte benadrukt de behoefte aan een hoogwaardige benchmark om MLLM-prestaties systematisch te evalueren in een realistische setting. Wij introduceren SONIC-O1, een uitgebreide, volledig door mensen geverifieerde benchmark die 13 realistische gespreksdomeinen omvat met 4.958 annotaties en demografische metadata. SONIC-O1 evalueert MLLM's op kern taken, waaronder open-einde samenvatting, meerkeuzevragen (MCQ) beantwoording en temporele lokalisatie met ondersteunende redeneringen. Experimenten met closed-source en open-source modellen onthullen beperkingen. Hoewel het prestatieverschil in MCQ-nauwkeurigheid tussen twee modelfamilies relatief klein is, observeren we een aanzienlijk prestatieverschil van 22,6% in temporele lokalisatie tussen de best presterende closed-source en open-source modellen. De prestaties verslechteren verder over demografische groepen heen, wat duidt op hardnekkige dispariteiten in modelgedrag. Over het geheel genomen biedt SONIC-O1 een open evaluatiesuite voor temporeel onderbouwde en sociaal robuuste multimodale begripsvorming. Wij geven SONIC-O1 vrij voor reproduceerbaarheid en onderzoek: Projectpagina: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

English

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

SONIC-O1: Een Real-World Benchmark voor de Evaluatie van Multimodale Grote Taalmodellen op Audiovisueel Begrip

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Samenvatting

Support