OmniCap-IF: 옴니 비디오 캡셔닝의 명령 수행 능력 벤치마킹 및 개선

초록

전방위 모달 대규모 언어 모델(OLLM)은 오디오와 시각적 스트림을 공동으로 처리하는 데 인상적인 능력을 입증했지만, 복잡하고 다면적인 사용자 명령을 엄격히 따르는 능력은 여전히 거의 탐구되지 않은 상태이다. 기존 벤치마크는 주로 전체적인 비디오 이해나 텍스트 전용 명령 수행에 초점을 맞추어, 모달과 사용자 제약 간의 복잡한 상호 작용을 포착하지 못한다. 이러한 격차를 해소하기 위해, 우리는 전방위 모달 캡셔닝에서 명령 수행 능력을 평가하도록 특별히 설계된 최초의 포괄적인 벤치마크인 OmniCap-IF를 도입한다. OmniCap-IF는 형식 정확성과 내용 정확성이라는 두 가지 차원에서 캡션을 평가하는 체계적인 프레임워크를 포함한다. 우리의 벤치마크는 순수 시각, 순수 오디오, 오디오-시각 모달에 걸쳐 50가지의 고유한 제약 유형을 포괄하며, 시공간적 정밀도를 평가하기 위해 시간적 접지를 통합한다. 주요 모델들을 1,920개의 고품질 샘플로 광범위하게 평가한 결과, 상당한 성능 격차가 드러났다. 또한, 우리의 분석은 중요한 "형식-내용 트레이드오프"를 발견하여, 형식 복잡성 증가가 모델의 전방위 모달 추론 능력을 직접적으로 저하시킨다는 것을 보여준다. 마지막으로, 이 분야를 발전시키기 위해 우리는 54K 규모의 명령 튜닝 데이터셋인 OmniCap-IF-54K를 구축하고 OmniCaptioner-IF를 제시하며, 이는 복잡한 명령 준수와 일반적인 전방위 모달 캡셔닝 성능 모두에서 눈에 띄는 개선을 달성한다.

English

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.