SocialOmni: Een benchmark voor audio-visuele sociale interactiviteit in omnimodellen

Samenvatting

Omni-modale grote taalmodellen (OLM's) herdefiniëren mens-machine-interactie door audio, visie en tekst native te integreren. Bestaande OLM-benchmarks blijven echter verankerd in statische, nauwkeurigheidsgerichte taken, waardoor een kritieke kloof ontstaat in de beoordeling van sociale interactiviteit: de fundamentele capaciteit om dynamische signalen in natuurlijke dialogen te navigeren. Daarom stellen wij SocialOmni voor, een uitgebreide benchmark die de evaluatie van deze conversatie-interactiviteit operationaliseert langs drie kern dimensies: (i) sprekersscheiding en -identificatie (wie spreekt), (ii) interruptietiming (wanneer in te vallen), en (iii) natuurlijke interruptiegeneratie (hoe de interruptie te formuleren). SocialOmni omvat 2.000 perceptie-exemplaren en een kwalitatief hoogwaardige diagnostische set van 209 interactie-generatie-instanties met strikte temporele en contextuele beperkingen, aangevuld met gecontroleerde audio-visuele inconsistentiescenario's om modelrobuustheid te testen. Wij testten 12 toonaangevende OLM's, wat een significante variatie in hun sociale interactiecapaciteiten tussen modellen aan het licht bracht. Verder toont onze analyse een duidelijke ontkoppeling aan tussen de perceptuele nauwkeurigheid van een model en zijn vermogen om contextueel passende interrupties te genereren, wat aangeeft dat op begrip gerichte metrieken alleen onvoldoende zijn om conversatieel sociale competentie te karakteriseren. Bemoedigender is dat deze diagnostieken uit SocialOmni bruikbare signalen opleveren om de kloof tussen perceptie en interactie in toekomstige OLM's te overbruggen.

English

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

SocialOmni: Een benchmark voor audio-visuele sociale interactiviteit in omnimodellen

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Samenvatting

Support