ChatPaper.aiChatPaper

CLAIR-A:利用大型语言模型评判音频字幕

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

September 19, 2024
作者: Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
cs.AI

摘要

自动音频字幕(AAC)任务要求模型生成音频输入的自然语言描述。评估这些机器生成的音频字幕是一个复杂的任务,需要考虑多种因素,其中包括听觉场景理解、声音对象推断、时间连贯性和场景的环境背景。虽然当前方法侧重于特定方面,但它们经常无法提供与人类判断良好对齐的整体评分。在这项工作中,我们提出了CLAIR-A,这是一种简单灵活的方法,利用大型语言模型(LLMs)的零样本能力,通过直接询问LLMs获取语义距离分数来评估候选音频字幕。在我们的评估中,与传统度量标准相比,CLAIR-A更好地预测了人类对质量的判断,相对准确性提高了5.8%,比领域特定的FENSE度量标准高出多达11%,超过了Clotho-Eval数据集上最佳通用度量标准。此外,CLAIR-A通过允许语言模型解释其评分背后的推理,提供了更多透明度,这些解释被人类评估者评分比基线方法提供的解释高出多达30%。CLAIR-A已在https://github.com/DavidMChan/clair-a 上公开提供。
English
The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at https://github.com/DavidMChan/clair-a.

Summary

AI-Generated Summary

PDF22November 16, 2024