OmniCap-IF: Benchmarken en verbeteren van instructievolgvermogen voor Omni-video-captioning

Samenvatting

Hoewel Omni-modale Grote Taalmodellen (OLLM's) indrukwekkende capaciteiten hebben getoond in het gezamenlijk verwerken van audio- en visuele stromen, blijft hun vermogen om strikt complexe, veelzijdige gebruikersinstructies te volgen grotendeels onontgonnen. Bestaande benchmarks richten zich voornamelijk op holistisch videobegrip of tekst-only instructie-opvolging, en slagen er niet in de ingewikkelde wisselwerking tussen modaliteiten en gebruikersbeperkingen vast te leggen. Om deze kloof te overbruggen, introduceren we OmniCap-IF, de eerste uitgebreide benchmark die specifiek is ontworpen om instructie-opvolgingscapaciteiten in omni-modale captioning te evalueren. OmniCap-IF bevat een systematisch raamwerk dat bijschriften beoordeelt op twee dimensies: formaatcorrectheid en inhoudscorrectheid. Onze benchmark omvat 50 verschillende beperkingstypen over pure visuele, pure audio- en audio-visuele modaliteiten, terwijl het Temporal Grounding integreert om spatio-temporele precisie te beoordelen. Uitgebreide evaluaties van prominente modellen op 1.920 hoogwaardige voorbeelden onthullen aanzienlijke prestatieverschillen. Verder onthult onze analyse een kritieke 'formaat-inhoud-afweging', waaruit blijkt dat het verhogen van de formatteringscomplexiteit direct de omni-modale redeneervermogens van modellen aantast. Ten slotte, om het vakgebied vooruit te helpen, stellen we een 54K instructie-tuning dataset samen, OmniCap-IF-54K, en presenteren we OmniCaptioner-IF, dat opmerkelijke verbeteringen behaalt in zowel complexe instructietrouw als algemene omni-modale captioningprestaties.

English

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.