LLM-AD:基於大型語言模型的音訊描述系統
LLM-AD: Large Language Model based Audio Description System
May 2, 2024
作者: Peng Chu, Jiang Wang, Andre Abrantes
cs.AI
摘要
音頻描述(AD)的發展是使視頻內容更具可訪問性和包容性的重要一步。傳統上,AD的製作需要大量熟練勞動力,而現有的自動化方法仍然需要廣泛的培訓,以整合多模態輸入並將輸出從字幕風格調整為AD風格。在本文中,我們介紹了一個自動化的AD生成流程,利用了GPT-4V(ision)強大的多模態和指令遵循能力。值得注意的是,我們的方法採用了現成的組件,無需額外的培訓。它生成的AD不僅符合已建立的自然語言AD製作標準,還通過基於跟踪的角色識別模塊保持跨幀的上下文一致的角色信息。對MAD數據集的深入分析顯示,我們的方法在自動AD生成方面取得了與基於學習的方法相當的性能,這得益於20.5的CIDEr分數的支持。
English
The development of Audio Description (AD) has been a pivotal step forward in
making video content more accessible and inclusive. Traditionally, AD
production has demanded a considerable amount of skilled labor, while existing
automated approaches still necessitate extensive training to integrate
multimodal inputs and tailor the output from a captioning style to an AD style.
In this paper, we introduce an automated AD generation pipeline that harnesses
the potent multimodal and instruction-following capacities of GPT-4V(ision).
Notably, our methodology employs readily available components, eliminating the
need for additional training. It produces ADs that not only comply with
established natural language AD production standards but also maintain
contextually consistent character information across frames, courtesy of a
tracking-based character recognition module. A thorough analysis on the MAD
dataset reveals that our approach achieves a performance on par with
learning-based methods in automated AD production, as substantiated by a CIDEr
score of 20.5.Summary
AI-Generated Summary