Cockatiel：融合合成与人类偏好训练的精细视频描述生成

摘要

视频详细描述（VDC）是视觉与语言桥梁构建中的关键任务，旨在对复杂视频内容进行细粒度描述。本文首先全面评估了当前最先进的方法，并系统性地识别出两大关键局限：对特定描述方面的能力偏倚以及与人类偏好的错位。针对这些不足，我们提出了Cockatiel，一种新颖的三阶段训练流程，通过集成合成数据与人类对齐训练来提升VDC性能。第一阶段，我们基于精心标注的数据集构建评分器，筛选出在特定细粒度视频-描述对齐及人类偏好方面表现优异的合成描述，同时舍弃其他。随后，利用这一精选数据集训练Cockatiel-13B，使其融合模型优势与人类偏好。最后，为进一步简化使用，我们从Cockatiel-13B中蒸馏出Cockatiel-8B。大量定量与定性实验验证了方法的有效性，我们不仅在VDCSCORE上以维度均衡的方式刷新了最新性能记录，而且根据人类评估结果，在人类偏好方面大幅领先于其他领先方案。

English

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Cockatiel：融合合成与人类偏好训练的精细视频描述生成

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

摘要

Support