대규모 언어 모델을 활용한 구어 이해를 위한 텍스트 증강

초록

음성 의미 구문 분석(Spoken Semantic Parsing, SSP)은 입력된 음성에서 기계가 이해할 수 있는 구문을 생성하는 과정을 포함합니다. 기존의 훈련 데이터로 표현된 응용 분야에 대한 강건한 모델을 학습하거나 새로운 분야로 확장하기 위해서는 음성-전사-의미 구문 데이터의 삼중항이 필요하지만, 이를 얻는 데는 많은 비용이 듭니다. 본 논문에서는 해당 음성이 없는 전사-의미 구문 데이터(비대응 텍스트)를 활용할 수 있는 방법을 탐구하여 이러한 문제를 해결하고자 합니다. 먼저, 기존 텍스트 코퍼스에서 비대응 텍스트를 추출할 경우, Joint Audio Text(JAT)와 Text-to-Speech(TTS)를 비교하여 비대응 텍스트에 대한 음성 표현을 생성하는 방법을 검토합니다. STOP 데이터셋에 대한 실험 결과, 기존 및 새로운 분야에서 비대응 텍스트를 사용함으로써 각각 2%와 30%의 절대 정확도(Exact Match, EM) 향상을 확인했습니다. 둘째, 기존 텍스트 코퍼스에서 비대응 텍스트를 사용할 수 없는 경우를 고려합니다. 이를 위해 대형 언어 모델(Large Language Models, LLMs)을 활용하여 기존 및 새로운 분야에 대한 비대응 텍스트를 생성하는 방법을 제안합니다. 실험 결과, Llama 2.0을 사용하여 의도와 함께 나타나는 예시 및 단어를 활용하여 비대응 텍스트를 생성할 수 있음을 확인했습니다. 생성된 텍스트를 JAT와 TTS와 함께 음성 의미 구문 분석에 사용함으로써, STOP 데이터셋에서 기존 및 새로운 분야에 대해 각각 1.4%와 2.6%의 절대 EM 향상을 달성했습니다.

English

Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.

대규모 언어 모델을 활용한 구어 이해를 위한 텍스트 증강

Augmenting text for spoken language understanding with Large Language Models

초록

Support