Cockatiel:整合合成與人類偏好訓練的細粒度影片描述生成
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
March 12, 2025
作者: Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li
cs.AI
摘要
視頻細緻描述(VDC)是視覺語言橋接中的一項關鍵任務,旨在對複雜視頻內容進行細粒度描述。本文首先全面評估了當前最先進的方法,並系統性地識別出兩個關鍵限制:對特定描述方面的能力偏見以及與人類偏好的不一致。為解決這些不足,我們提出了Cockatiel,一種新穎的三階段訓練流程,通過合成與人類對齊的訓練來提升VDC性能。在第一階段,我們從一個精心註釋的數據集中導出評分器,用於選擇在特定細粒度視頻-描述對齊和人類偏好方面表現優異的合成描述,同時忽略其他。接著,我們使用這一精選數據集訓練Cockatiel-13B,使其融合模型優勢與人類偏好。最後,我們進一步從Cockatiel-13B中蒸餾出Cockatiel-8B,以便於使用。大量的定量與定性實驗證明了我們方法的有效性,我們不僅在VDCSCORE上以維度平衡的方式設定了新的性能標杆,而且根據人類評估結果,在人類偏好方面大幅領先於其他主要方案。
English
Video Detailed Captioning (VDC) is a crucial task for vision-language
bridging, enabling fine-grained descriptions of complex video content. In this
paper, we first comprehensively benchmark current state-of-the-art approaches
and systematically identified two critical limitations: biased capability
towards specific captioning aspect and misalignment with human preferences. To
address these deficiencies, we propose Cockatiel, a novel three-stage training
pipeline that ensembles synthetic and human-aligned training for improving VDC
performance. In the first stage, we derive a scorer from a meticulously
annotated dataset to select synthetic captions high-performing on certain
fine-grained video-caption alignment and human-preferred while disregarding
others. Then, we train Cockatiel-13B, using this curated dataset to infuse it
with assembled model strengths and human preferences. Finally, we further
distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive
quantitative and qualitative experiments reflect the effectiveness of our
method, as we not only set new state-of-the-art performance on VDCSCORE in a
dimension-balanced way but also surpass leading alternatives on human
preference by a large margin as depicted by the human evaluation results.Summary
AI-Generated Summary