OmniCap-IF：全方位影片描述中遵循指令能力的基準測試與改進

摘要

儘管全模态大型語言模型（OLLMs）在同時處理音訊與視覺串流方面展現出令人印象深刻的能力，但它們能否嚴格遵循複雜且多面向的使用者指令，仍有待深入探討。現有基準主要聚焦於整體影片理解或純文字指令遵循，未能捕捉模態與使用者限制之間的複雜交互。為填補此缺口，我們提出OmniCap-IF，這是首個專門設計用於評估全模態字幕生成中指令遵循能力的綜合基準。OmniCap-IF採用系統性框架，從格式正確性與內容正確性兩個維度評估字幕。我們的基準涵蓋純視覺、純音訊及音視覺模態共50種不同的限制類型，並整合時間定位（Temporal Grounding）以評估時空精確度。對1,920個高品質樣本進行的廣泛模型評估顯示出顯著的效能差異。此外，我們的分析揭露了關鍵的「格式-內容權衡」現象，證明增加格式複雜度會直接削弱模型的全模態推理能力。最後，為推動領域進展，我們策劃了54K筆指令微調資料集OmniCap-IF-54K，並提出OmniCaptioner-IF，該模型在複雜指令遵循與一般全模態字幕生成效能上均取得顯著提升。

English

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.