LLM-AD:基于大型语言模型的音频描述系统
LLM-AD: Large Language Model based Audio Description System
May 2, 2024
作者: Peng Chu, Jiang Wang, Andre Abrantes
cs.AI
摘要
音频描述(AD)的发展是使视频内容更具可访问性和包容性的重要一步。传统上,AD的制作需要大量熟练劳动力,而现有的自动化方法仍然需要广泛的训练,以整合多模态输入,并将输出从字幕风格调整为AD风格。在本文中,我们介绍了一种自动化AD生成流程,利用了GPT-4V(ision)强大的多模态和遵循指令的能力。值得注意的是,我们的方法采用了现成的组件,无需额外的训练。通过基于跟踪的角色识别模块,它生成的AD不仅符合已建立的自然语言AD制作标准,而且在各帧之间保持上下文一致的角色信息。对MAD数据集的彻底分析显示,我们的方法在自动化AD制作中取得了与基于学习的方法相当的性能,这得到了CIDEr分数20.5的证实。
English
The development of Audio Description (AD) has been a pivotal step forward in
making video content more accessible and inclusive. Traditionally, AD
production has demanded a considerable amount of skilled labor, while existing
automated approaches still necessitate extensive training to integrate
multimodal inputs and tailor the output from a captioning style to an AD style.
In this paper, we introduce an automated AD generation pipeline that harnesses
the potent multimodal and instruction-following capacities of GPT-4V(ision).
Notably, our methodology employs readily available components, eliminating the
need for additional training. It produces ADs that not only comply with
established natural language AD production standards but also maintain
contextually consistent character information across frames, courtesy of a
tracking-based character recognition module. A thorough analysis on the MAD
dataset reveals that our approach achieves a performance on par with
learning-based methods in automated AD production, as substantiated by a CIDEr
score of 20.5.Summary
AI-Generated Summary