대규모 언어 모델을 위한 범용 음성 능력 향상: 비정렬 데이터 활용

초록

본 연구에서는 주의 깊게 선별된 쌍 데이터를 사용하지 않으면서도, 광범위한 LLM 능력을 유지한 채 end-to-end 범용 음성 처리 및 추론 능력을 갖춘 instruction-tuned Llama-2 모델을 확장합니다. 제안된 모델은 텍스트 대신 오디오 프롬프트를 활용하여 대화를 지속할 수 있습니다. 이러한 모델은 음성 질의응답, 음성 번역, 오디오 요약 등 다양한 폐쇄형 및 개방형 도메인 작업을 수행할 수 있는 확장된 크로스모달 능력을 갖추고 있습니다. 이는 LLM이 제한된 수의 사전 지정된 작업을 위해 오디오를 처리하도록 확장된 기존 음성 접근 방식과는 다릅니다. 실험 결과, 우리의 end-to-end 접근 방식은 프롬프트에 대한 응답을 모델링하는 측면에서 캐스케이드 시스템(음성 인식기 + LLM)과 동등하거나 더 나은 성능을 보여줍니다. 더욱이, 캐스케이드와 달리 우리의 접근 방식은 텍스트와 오디오 모달리티를 교환하고 대화의 이전 컨텍스트를 활용하여 더 나은 결과를 제공할 수 있는 능력을 보여줍니다.

English

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of LLM capabilities, without using any carefully curated paired data. The proposed model can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform speech question answering, speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. Experiments show that our end-to-end approach is on par with or outperforms a cascaded system (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike a cascade, our approach shows the ability to interchange text and audio modalities and utilize the prior context in a conversation to provide better results.

대규모 언어 모델을 위한 범용 음성 능력 향상: 비정렬 데이터 활용

Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data

초록

Support