Op weg naar Universele Video MLLM's met Attribuut-Gestructureerde en Kwaliteitsgeverifieerde Instructies

Samenvatting

Universeel videobegrip vereist het modelleren van fijnmazige visuele en auditieve informatie over tijd in diverse real-world scenario's. De prestaties van bestaande modellen worden echter voornamelijk beperkt door video-instructiedata die complexe audiovisuele inhoud weergeven als enkele, onvolledige beschrijvingen, waarbij een fijnmazige organisatie en betrouwbare annotatie ontbreekt. Om dit aan te pakken, introduceren wij: (i) ASID-1M, een open-source verzameling van één miljoen gestructureerde, fijnmazige audiovisuele instructieannotaties met supervisie op basis van enkele en meerdere attributen; (ii) ASID-Verify, een schaalbare datacuratiepijplijn voor annotatie, met automatische verificatie en verfijning die semantische en temporele consistentie afdwingt tussen beschrijvingen en de corresponderende audiovisuele inhoud; en (iii) ASID-Captioner, een videobegripmodel getraind via Supervised Fine-Tuning (SFT) op de ASID-1M. Experimenten op zeven benchmarks die audiovisuele ondertiteling, attribuutgewijze ondertiteling, op ondertiteling gebaseerde vraag-antwoordtaken en op ondertiteling gebaseerde temporele lokalisatie bestrijken, tonen aan dat ASID-Captioner de kwaliteit van fijnmazige ondertiteling verbetert, terwijl hallucinaties worden verminderd en het volgen van instructies verbetert. Het behaalt state-of-the-art prestaties onder open-source modellen en is competitief met Gemini-3-Pro.

English

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Op weg naar Universele Video MLLM's met Attribuut-Gestructureerde en Kwaliteitsgeverifieerde Instructies

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Samenvatting

Support