IF-VidCap:视频字幕模型能否遵循指令?
IF-VidCap: Can Video Caption Models Follow Instructions?
October 21, 2025
作者: Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在视频字幕生成方面展现了卓越能力,实际应用场景却要求字幕能够遵循特定用户指令,而非生成详尽无约束的描述。然而,现有基准测试主要评估描述的全面性,很大程度上忽视了指令遵循能力。为填补这一空白,我们推出了IF-VidCap,一个用于评估可控视频字幕生成的新基准,包含1,400个高质量样本。与现有的视频字幕或通用指令遵循基准不同,IF-VidCap引入了一个系统框架,从格式正确性和内容正确性两个维度评估字幕。我们对超过20个知名模型进行了全面评估,揭示了一个微妙格局:尽管专有模型仍占据主导地位,但性能差距正在缩小,顶级开源解决方案现已接近持平。此外,我们发现专为密集字幕设计的模型在处理复杂指令时表现逊色于通用MLLMs,这表明未来工作应同时推进描述的丰富性和指令遵循的准确性。
English
Although Multimodal Large Language Models (MLLMs) have demonstrated
proficiency in video captioning, practical applications require captions that
follow specific user instructions rather than generating exhaustive,
unconstrained descriptions. Current benchmarks, however, primarily assess
descriptive comprehensiveness while largely overlooking instruction-following
capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for
evaluating controllable video captioning, which contains 1,400 high-quality
samples. Distinct from existing video captioning or general
instruction-following benchmarks, IF-VidCap incorporates a systematic framework
that assesses captions on two dimensions: format correctness and content
correctness. Our comprehensive evaluation of over 20 prominent models reveals a
nuanced landscape: despite the continued dominance of proprietary models, the
performance gap is closing, with top-tier open-source solutions now achieving
near-parity. Furthermore, we find that models specialized for dense captioning
underperform general-purpose MLLMs on complex instructions, indicating that
future work should simultaneously advance both descriptive richness and
instruction-following fidelity.