CLAIR-A: 大規模言語モデルを活用したオーディオキャプションの判断

要旨

自動音声キャプショニング（AAC）タスクは、モデルに音声入力の自然言語説明を生成させることを求めます。これらの機械生成された音声キャプションを評価することは、聴覚シーン理解、音オブジェクト推論、時間的一貫性、およびシーンの環境コンテキストなど、多様な要因を考慮する複雑なタスクです。現在の手法は特定の側面に焦点を当てていますが、しばしば人間の判断とよく一致する総合スコアを提供できません。本研究では、大規模言語モデル（LLM）のゼロショット能力を活用して、候補音声キャプションを評価するためにLLMに直接意味的距離スコアを尋ねるシンプルかつ柔軟な方法であるCLAIR-Aを提案します。評価では、CLAIR-Aは、ドメイン固有のFENSEメトリクスに比べて人間の品質判断をよりよく予測し、Clotho-Evalデータセットにおいて一般的な最良指標よりも最大11%向上する5.8%の相対精度向上を達成しました。さらに、CLAIR-Aは、言語モデルにスコアの背後にある推論を説明させることで、これらの説明がベースライン手法が提供するものよりも、人間の評価者によって最大30%向上した透明性を提供します。CLAIR-Aは、https://github.com/DavidMChan/clair-a で一般に利用可能です。

English

The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at https://github.com/DavidMChan/clair-a.

CLAIR-A: 大規模言語モデルを活用したオーディオキャプションの判断

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

要旨

Support