AVoCaDO:基于时序编排的视听视频字幕生成器
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
October 12, 2025
作者: Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan
cs.AI
摘要
视听视频字幕生成旨在创建语义丰富的描述,同时确保视觉与听觉事件在时间上的精确对齐,从而提升视频理解与生成的效果。本文介绍了AVoCaDO,一款由音频与视觉模态间时序编排驱动的强大视听视频字幕生成器。我们提出了一种两阶段的后训练流程:(1)AVoCaDO SFT,该阶段在新构建的包含107K条高质量、时间对齐的视听字幕数据集上对模型进行微调;(2)AVoCaDO GRPO,此阶段利用定制化的奖励函数,在规范字幕长度并减少崩溃的同时,进一步增强时序一致性和对话准确性。实验结果表明,AVoCaDO在四项视听视频字幕生成基准测试中显著超越了现有的开源模型,并且在仅视觉设置下的VDC和DREAM-1K基准测试中也展现了竞争力。
English
Audiovisual video captioning aims to generate semantically rich descriptions
with temporal alignment between visual and auditory events, thereby benefiting
both video understanding and generation. In this paper, we present AVoCaDO, a
powerful audiovisual video captioner driven by the temporal orchestration
between audio and visual modalities. We propose a two-stage post-training
pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated
dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2)
AVoCaDO GRPO, which leverages tailored reward functions to further enhance
temporal coherence and dialogue accuracy while regularizing caption length and
reducing collapse. Experimental results demonstrate that AVoCaDO significantly
outperforms existing open-source models across four audiovisual video
captioning benchmarks, and also achieves competitive performance on the VDC and
DREAM-1K benchmark under visual-only settings.