大規模言語モデルエージェントによるエンドツーエンド同時音声翻訳における人間並みの性能達成に向けて

要旨

本論文では、高品質で人間らしい同時音声翻訳（SiST）システムであるCross Language Agent -- Simultaneous Interpretation（CLASI）を提案する。プロの人間通訳者に着想を得て、翻訳品質と遅延のバランスを取るために、新しいデータ駆動型の読み書き戦略を採用している。ドメイン固有の用語翻訳の課題に対処するため、CLASIはマルチモーダル検索モジュールを利用して関連情報を取得し、翻訳を強化する。大規模言語モデル（LLMs）のサポートにより、本アプローチは入力音声、過去の文脈、および検索された情報を考慮して、エラー耐性のある翻訳を生成することができる。実験結果は、本システムが他のシステムを大幅に上回ることを示している。プロの人間通訳者に準拠して、CLASIをより優れた人間評価指標である有効情報伝達率（VIP）で評価し、リスナーに成功裏に伝達される情報量を測定する。現実世界のシナリオでは、スピーチがしばしば不流暢で、非公式で、不明瞭であるが、CLASIは中国語から英語、英語から中国語の翻訳方向でそれぞれ81.3％と78.0％のVIPを達成する。対照的に、最先端の商用またはオープンソースシステムは35.4％と41.6％しか達成できない。他のシステムが13％未満のVIPしか達成できない極めて難しいデータセットにおいても、CLASIは70％のVIPを達成することができる。

English

In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.