Cockatiel: 詳細なビデオキャプション生成のための合成データと人間選好トレーニングのアンサンブル

要旨

ビデオ詳細キャプショニング（VDC）は、視覚と言語を橋渡しする重要なタスクであり、複雑なビデオコンテンツの細粒度な記述を可能にします。本論文では、まず現在の最先端手法を包括的にベンチマークし、特定のキャプショニング側面への偏った能力と人間の嗜好とのミスアラインメントという2つの重要な限界を系統的に特定しました。これらの欠点を解決するため、VDC性能を向上させるための合成データと人間の嗜好に沿った訓練を組み合わせた新しい3段階の訓練パイプラインであるCockatielを提案します。最初の段階では、細粒度なビデオとキャプションのアラインメントおよび人間の嗜好に優れた合成キャプションを選択するために、厳密に注釈されたデータセットからスコアラーを導出します。次に、この精選されたデータセットを使用してCockatiel-13Bを訓練し、組み合わせたモデルの強みと人間の嗜好を注入します。最後に、使用の容易さのためにCockatiel-13BからCockatiel-8Bをさらに蒸留します。広範な定量的および定性的な実験は、我々の手法の有効性を反映しており、VDCSCOREにおいて次元バランスの取れた方法で新たな最先端性能を達成するだけでなく、人間評価結果に示されるように、人間の嗜好においても主要な代替手法を大きく上回りました。

English

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Cockatiel: 詳細なビデオキャプション生成のための合成データと人間選好トレーニングのアンサンブル

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

要旨

Support