LLM-AD:基于大语言模型的音频描述系统
LLM-AD: Large Language Model based Audio Description System
May 2, 2024
作者: Peng Chu, Jiang Wang, Andre Abrantes
cs.AI
摘要
音频描述技术的发展是提升视频内容可访问性与包容性的关键进步。传统音频描述制作需要大量专业人力投入,而现有自动化方法仍需通过大量训练来整合多模态输入,并将输出从字幕风格调整为音频描述风格。本文提出一种基于GPT-4V强大多模态与指令跟随能力的自动化音频描述生成流程。值得注意的是,该方法采用现成组件构建,无需额外训练即可生成既符合自然语言音频描述制作标准,又能通过基于追踪的角色识别模块保持跨帧角色信息上下文一致性的音频描述。在MAD数据集上的全面分析表明,我们的方法在自动化音频描述生产方面达到与基于学习的方法相当的性能,CIDEr评分达到20.5即为明证。
English
The development of Audio Description (AD) has been a pivotal step forward in
making video content more accessible and inclusive. Traditionally, AD
production has demanded a considerable amount of skilled labor, while existing
automated approaches still necessitate extensive training to integrate
multimodal inputs and tailor the output from a captioning style to an AD style.
In this paper, we introduce an automated AD generation pipeline that harnesses
the potent multimodal and instruction-following capacities of GPT-4V(ision).
Notably, our methodology employs readily available components, eliminating the
need for additional training. It produces ADs that not only comply with
established natural language AD production standards but also maintain
contextually consistent character information across frames, courtesy of a
tracking-based character recognition module. A thorough analysis on the MAD
dataset reveals that our approach achieves a performance on par with
learning-based methods in automated AD production, as substantiated by a CIDEr
score of 20.5.