Cockatiel:融合合成与人类偏好训练的精细视频描述生成
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
March 12, 2025
作者: Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li
cs.AI
摘要
视频详细描述(VDC)是视觉与语言桥梁构建中的关键任务,旨在对复杂视频内容进行细粒度描述。本文首先全面评估了当前最先进的方法,并系统性地识别出两大关键局限:对特定描述方面的能力偏倚以及与人类偏好的错位。针对这些不足,我们提出了Cockatiel,一种新颖的三阶段训练流程,通过集成合成数据与人类对齐训练来提升VDC性能。第一阶段,我们基于精心标注的数据集构建评分器,筛选出在特定细粒度视频-描述对齐及人类偏好方面表现优异的合成描述,同时舍弃其他。随后,利用这一精选数据集训练Cockatiel-13B,使其融合模型优势与人类偏好。最后,为进一步简化使用,我们从Cockatiel-13B中蒸馏出Cockatiel-8B。大量定量与定性实验验证了方法的有效性,我们不仅在VDCSCORE上以维度均衡的方式刷新了最新性能记录,而且根据人类评估结果,在人类偏好方面大幅领先于其他领先方案。
English
Video Detailed Captioning (VDC) is a crucial task for vision-language
bridging, enabling fine-grained descriptions of complex video content. In this
paper, we first comprehensively benchmark current state-of-the-art approaches
and systematically identified two critical limitations: biased capability
towards specific captioning aspect and misalignment with human preferences. To
address these deficiencies, we propose Cockatiel, a novel three-stage training
pipeline that ensembles synthetic and human-aligned training for improving VDC
performance. In the first stage, we derive a scorer from a meticulously
annotated dataset to select synthetic captions high-performing on certain
fine-grained video-caption alignment and human-preferred while disregarding
others. Then, we train Cockatiel-13B, using this curated dataset to infuse it
with assembled model strengths and human preferences. Finally, we further
distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive
quantitative and qualitative experiments reflect the effectiveness of our
method, as we not only set new state-of-the-art performance on VDCSCORE in a
dimension-balanced way but also surpass leading alternatives on human
preference by a large margin as depicted by the human evaluation results.Summary
AI-Generated Summary