Cockatiel: 상세 비디오 캡션 생성을 위한 합성 데이터와 인간 선호도 기반 학습의 앙상블

초록

비디오 상세 캡셔닝(VDC)은 복잡한 비디오 콘텐츠에 대한 세밀한 설명을 가능하게 하는 시각-언어 연결의 중요한 과제입니다. 본 논문에서는 먼저 현재 최첨단 접근법들을 포괄적으로 벤치마킹하고, 두 가지 중요한 한계점을 체계적으로 식별했습니다: 특정 캡셔닝 측면에 대한 편향된 능력과 인간 선호도와의 불일치입니다. 이러한 결점을 해결하기 위해, 우리는 VDC 성능을 향상시키기 위해 합성 데이터와 인간 정렬 훈련을 결합한 새로운 3단계 훈련 파이프라인인 Cockatiel을 제안합니다. 첫 번째 단계에서는 세심하게 주석이 달린 데이터셋에서 도출된 스코어를 사용하여 특정 세밀한 비디오-캡션 정렬 및 인간 선호도에서 우수한 성능을 보이는 합성 캡션을 선택하고 나머지는 배제합니다. 그런 다음, 이렇게 선별된 데이터셋을 사용하여 Cockatiel-13B를 훈련시켜 모델의 통합된 강점과 인간 선호도를 주입합니다. 마지막으로, 사용의 편의를 위해 Cockatiel-13B에서 Cockatiel-8B를 추가로 증류합니다. 광범위한 정량적 및 정성적 실험은 우리의 방법의 효과를 반영하며, 우리는 VDCSCORE에서 차원 균형을 유지하며 새로운 최첨단 성능을 달성했을 뿐만 아니라, 인간 평가 결과에서도 선도적인 대안들을 큰 차이로 능가했습니다.

English

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.