アライメントなしで音声-テキストの共同表現を改善する

要旨

この1年、テキストプロンプトによる画像生成において驚異的な進展が見られました。その基盤となったのは、テキストと画像の領域を共同で表現するクロスモーダル表現空間という概念です。ASR（自動音声認識）の分野では、この概念は共同音声-テキストエンコーダとして応用され、非ペアの音声とテキストの両方で訓練することで、非常に大規模なパラメータモデルの能力をスケールさせることが可能になりました。これらの手法は有望ではあるものの、音声とテキストの間に内在するシーケンス長の不一致を特別に扱う必要があり、アップサンプリングのヒューリスティックや明示的なアライメントモデルを用いてきました。本研究では、共同音声-テキストエンコーダがシーケンス長を無視することで、自然にモダリティ間で一貫した表現を達成することを示す証拠を提示し、一貫性損失が長さの違いを許容し、最適なアライメントを単に仮定する可能性を論じます。このような損失が、大規模パラメータの単一言語システムと多言語システムの両方において、下流のWER（単語誤り率）を改善することを示します。

English

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

アライメントなしで音声-テキストの共同表現を改善する

Improving Joint Speech-Text Representations Without Alignment

要旨

Support