IF-VidCap:影片字幕模型能否遵循指令?
IF-VidCap: Can Video Caption Models Follow Instructions?
October 21, 2025
作者: Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)在視頻字幕生成方面展現出卓越能力,但實際應用場景往往要求字幕能遵循特定用戶指令,而非生成無約束的詳盡描述。然而,現有基準主要評估描述的全面性,很大程度上忽視了指令遵循能力。為填補這一空白,我們引入了IF-VidCap,一個用於評估可控視頻字幕生成的新基準,包含1,400個高質量樣本。與現有的視頻字幕或通用指令遵循基準不同,IF-VidCap採用了一個系統性框架,從格式正確性和內容正確性兩個維度評估字幕。我們對超過20個知名模型的全面評估揭示了一個細微的格局:儘管專有模型仍占主導地位,但性能差距正在縮小,頂級開源解決方案現已接近同等水平。此外,我們發現專注於密集字幕生成的模型在處理複雜指令時表現不如通用MLLMs,這表明未來工作應同時推進描述的豐富性和指令遵循的精確性。
English
Although Multimodal Large Language Models (MLLMs) have demonstrated
proficiency in video captioning, practical applications require captions that
follow specific user instructions rather than generating exhaustive,
unconstrained descriptions. Current benchmarks, however, primarily assess
descriptive comprehensiveness while largely overlooking instruction-following
capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for
evaluating controllable video captioning, which contains 1,400 high-quality
samples. Distinct from existing video captioning or general
instruction-following benchmarks, IF-VidCap incorporates a systematic framework
that assesses captions on two dimensions: format correctness and content
correctness. Our comprehensive evaluation of over 20 prominent models reveals a
nuanced landscape: despite the continued dominance of proprietary models, the
performance gap is closing, with top-tier open-source solutions now achieving
near-parity. Furthermore, we find that models specialized for dense captioning
underperform general-purpose MLLMs on complex instructions, indicating that
future work should simultaneously advance both descriptive richness and
instruction-following fidelity.