IF-VidCap：视频字幕模型能否遵循指令？

摘要

尽管多模态大语言模型（MLLMs）在视频字幕生成方面展现了卓越能力，实际应用场景却要求字幕能够遵循特定用户指令，而非生成详尽无约束的描述。然而，现有基准测试主要评估描述的全面性，很大程度上忽视了指令遵循能力。为填补这一空白，我们推出了IF-VidCap，一个用于评估可控视频字幕生成的新基准，包含1,400个高质量样本。与现有的视频字幕或通用指令遵循基准不同，IF-VidCap引入了一个系统框架，从格式正确性和内容正确性两个维度评估字幕。我们对超过20个知名模型进行了全面评估，揭示了一个微妙格局：尽管专有模型仍占据主导地位，但性能差距正在缩小，顶级开源解决方案现已接近持平。此外，我们发现专为密集字幕设计的模型在处理复杂指令时表现逊色于通用MLLMs，这表明未来工作应同时推进描述的丰富性和指令遵循的准确性。

English

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

IF-VidCap：视频字幕模型能否遵循指令？

IF-VidCap: Can Video Caption Models Follow Instructions?

摘要

Support