利用大型語言模型增強口語理解的文本

摘要

口語語義解析（SSP）涉及從語音輸入生成機器可理解的解析。訓練強大的模型以應用於現有訓練數據中表示的應用領域，或擴展到新領域，需要相應的語音-轉錄-語義解析數據三元組，這是昂貴的。本文通過研究可以使用轉錄-語義解析數據（未配對文本）而無需相應語音的方法來應對這一挑戰。首先，當未配對文本來自現有文本語料庫時，比較了聯合音頻文本（JAT）和文本轉語音（TTS）作為為未配對文本生成語音表示的方法。對STOP數據集的實驗表明，現有和新領域的未配對文本分別使性能提高了2%和30%的絕對準確度（EM）。其次，考慮當現有文本語料庫中沒有未配對文本的情況。我們建議提示大型語言模型（LLMs）生成現有和新領域的未配對文本。實驗表明，與意圖同時出現的示例和單詞可用於使用Llama 2.0生成未配對文本。將生成的文本與JAT和TTS一起用於口語語義解析可使現有和新領域的STOP EM分別提高1.4%和2.6%的絕對準確度。

English

Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.

利用大型語言模型增強口語理解的文本

Augmenting text for spoken language understanding with Large Language Models

摘要

Support