音声翻訳の評価方法：ソース情報を考慮したニューラル機械翻訳メトリクスの活用

要旨

音声テキスト翻訳（ST）システムの自動評価は、一般に翻訳仮説と1つ以上の参照翻訳を比較することで行われます。この手法はある程度有効ではあるものの、ソース入力からの貴重な情報を無視するという参照ベース評価の限界を引き継いでいます。機械翻訳（MT）分野では、ソーステキストを組み込んだニューラル評価指標が人間の判断との高い相関を達成することが最近の進展で示されています。しかし、このアイデアをSTに拡張することは自明ではありません。なぜならソースがテキストではなく音声であり、信頼性の高い文字起こしやソースと参照訳のアライメントが利用できない場合が頻繁にあるためです。本研究では、特にソース文字起こしが利用できない現実の運用条件に焦点を当て、ST向けのソース考慮型評価指標について初めての体系的研究を行います。我々は、入力音声のテキスト代理を生成するための2つの相補的な戦略、すなわち自動音声認識（ASR）文字起こしと参照翻訳の逆翻訳を探求し、合成ソースと参照翻訳の間のアライメント不一致に対処するための新しい二段階クロスリンガル再セグメンテーションアルゴリズムを導入します。79の言語ペアをカバーする2つのSTベンチマークと、多様なアーキテクチャと性能レベルを持つ6つのSTシステムを用いて実施した実験により、単語誤り率が20%未満の場合、逆翻訳よりもASR文字起こしの方が信頼性の高い合成ソースとなること、一方で逆翻訳は常に計算コストが低くながらも有効な代替手段となり得ることが示されました。さらに、我々のクロスリンガル再セグメンテーションアルゴリズムは、ST評価においてソース考慮型MT評価指標を頑健に利用することを可能にし、より正確で原理的な音声翻訳の評価方法論への道を開くものです。

English

Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

音声翻訳の評価方法：ソース情報を考慮したニューラル機械翻訳メトリクスの活用

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

要旨

Support