大規模言語モデルを用いた音声言語理解のためのテキスト拡張

要旨

音声意味解析（SSP）は、入力音声から機械が理解可能な解析結果を生成することを含みます。既存のアプリケーションドメインに対して訓練データで表現された頑健なモデルを訓練するか、新しいドメインに拡張するためには、対応する音声-文字起こし-意味解析のトリプレットデータが必要ですが、これを取得するのはコストがかかります。本論文では、対応する音声データを持たない文字起こし-意味解析データ（非ペアテキスト）を利用可能な手法を検討することで、この課題に取り組みます。まず、非ペアテキストが既存のテキストコーパスから抽出される場合、Joint Audio Text（JAT）とText-to-Speech（TTS）を比較し、非ペアテキストの音声表現を生成する方法を検討します。STOPデータセットでの実験では、既存および新しいドメインからの非ペアテキストが、それぞれ2%と30%の絶対Exact Match（EM）の性能向上をもたらすことが示されました。次に、非ペアテキストが既存のテキストコーパスに存在しない場合を考慮します。我々は、大規模言語モデル（LLM）をプロンプトして、既存および新しいドメインの非ペアテキストを生成することを提案します。実験では、Llama 2.0を使用して、意図と共起する例や単語を用いて非ペアテキストを生成できることが示されました。生成されたテキストをJATとTTSで音声意味解析に使用すると、STOPデータセットでのEMが、既存ドメインで1.4%、新しいドメインで2.6%の絶対値で向上しました。

English

Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.

大規模言語モデルを用いた音声言語理解のためのテキスト拡張

Augmenting text for spoken language understanding with Large Language Models

要旨

Support