AVoCaDO:基於時間序列協調的視聽視頻字幕生成器
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
October 12, 2025
作者: Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan
cs.AI
摘要
視聽視頻字幕生成旨在產生語義豐富的描述,並實現視覺與聽覺事件之間的時序對齊,從而提升視頻理解與生成的能力。本文介紹了AVoCaDO,這是一個由音頻與視覺模態間時序協調驅動的強大視聽視頻字幕生成器。我們提出了一個兩階段的後訓練流程:(1) AVoCaDO SFT,該階段在新構建的包含107K高質量、時序對齊的視聽字幕數據集上對模型進行微調;(2) AVoCaDO GRPO,該階段利用定制的獎勵函數進一步增強時序連貫性和對話準確性,同時規範字幕長度並減少崩潰現象。實驗結果表明,AVoCaDO在四個視聽視頻字幕生成基準測試中顯著優於現有的開源模型,並且在僅視覺設置下的VDC和DREAM-1K基準測試中也展現了競爭力。
English
Audiovisual video captioning aims to generate semantically rich descriptions
with temporal alignment between visual and auditory events, thereby benefiting
both video understanding and generation. In this paper, we present AVoCaDO, a
powerful audiovisual video captioner driven by the temporal orchestration
between audio and visual modalities. We propose a two-stage post-training
pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated
dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2)
AVoCaDO GRPO, which leverages tailored reward functions to further enhance
temporal coherence and dialogue accuracy while regularizing caption length and
reducing collapse. Experimental results demonstrate that AVoCaDO significantly
outperforms existing open-source models across four audiovisual video
captioning benchmarks, and also achieves competitive performance on the VDC and
DREAM-1K benchmark under visual-only settings.