언어 모델은 말하는 동안 듣는 기능을 수행할 수 있습니다.

초록

대화는 인간-컴퓨터 상호작용(HCI)에서 가장 자연스러운 방법으로 작용합니다. 최근 음성 언어 모델(SLM)의 발전은 음성 기반 대화형 AI를 크게 향상시켰습니다. 그러나 이러한 모델들은 턴 기반 대화에 제한되어 있어, 실시간으로 발화 상황에서 인간과 상호작용하는 능력이 부족합니다. 예를 들어, 생성된 콘텐츠가 만족스럽지 않을 때 중단되는 경우 등이 있습니다. 이러한 제한을 해결하기 위해, 우리는 상호작용형 음성 언어 모델(iSLM)에서 전 이중 모델링(FDM)을 탐구하며, 실시간 상호작용을 강화하고, 더 구체적으로 중단 능력을 탐구합니다. 우리는 새로운 모델 디자인인 '말하는 동안 듣는' 언어 모델(LSLM)을 소개합니다. 이 모델은 청취 및 발화 채널을 갖춘 end-to-end 시스템입니다. LSLM은 음성 생성을 위해 토큰 기반 디코더 전용 TTS를 사용하고, 실시간 오디오 입력을 위해 스트리밍 자가 지도 학습(SSL) 인코더를 사용합니다. LSLM은 자동 회귀 생성을 위해 두 채널을 융합하고, 실시간으로 턴을 인식합니다. 세 가지 융합 전략인 초기 융합, 중간 융합 및 후반 융합이 탐구되었는데, 중간 융합이 음성 생성과 실시간 상호작용 사이의 최적 균형을 달성했습니다. 명령 기반 FDM과 음성 기반 FDM의 두 가지 실험 설정은 LSLM이 잡음에 강하고 다양한 명령에 민감하다는 것을 보여줍니다. 우리의 결과는 LSLM이 기존 시스템에 미치는 영향을 최소화하면서 이중 통신 능력을 달성할 수 있는 능력을 강조합니다. 본 연구는 상호작용형 음성 대화 시스템의 발전을 촉진하여 현실 세계에서의 적용 가능성을 향상시키는 것을 목표로 합니다.

English

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

언어 모델은 말하는 동안 듣는 기능을 수행할 수 있습니다.

Language Model Can Listen While Speaking

초록

Support