ChatPaper.aiChatPaper

Cockatiel:融合合成与人类偏好训练的精细视频描述生成

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

March 12, 2025
作者: Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li
cs.AI

摘要

视频详细描述(VDC)是视觉与语言桥梁构建中的关键任务,旨在对复杂视频内容进行细粒度描述。本文首先全面评估了当前最先进的方法,并系统性地识别出两大关键局限:对特定描述方面的能力偏倚以及与人类偏好的错位。针对这些不足,我们提出了Cockatiel,一种新颖的三阶段训练流程,通过集成合成数据与人类对齐训练来提升VDC性能。第一阶段,我们基于精心标注的数据集构建评分器,筛选出在特定细粒度视频-描述对齐及人类偏好方面表现优异的合成描述,同时舍弃其他。随后,利用这一精选数据集训练Cockatiel-13B,使其融合模型优势与人类偏好。最后,为进一步简化使用,我们从Cockatiel-13B中蒸馏出Cockatiel-8B。大量定量与定性实验验证了方法的有效性,我们不仅在VDCSCORE上以维度均衡的方式刷新了最新性能记录,而且根据人类评估结果,在人类偏好方面大幅领先于其他领先方案。
English
Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Summary

AI-Generated Summary

PDF52March 17, 2025