語言模型可以在說話的同時聆聽。

摘要

對話是人與電腦互動（HCI）中最自然的方式。最近語音語言模型（SLM）的進步顯著增強了基於語音的對話式人工智能。然而，這些模型僅限於基於輪次的對話，缺乏在實時口語情境中與人類互動的能力，例如在生成的內容不滿意時被打斷。為了解決這些限制，我們探索了互動式語音語言模型（iSLM）中的全雙工建模（FDM），著重於增強實時互動，更明確地探索打斷的基本能力。我們引入了一種新型模型設計，即聽說語言模型（LSLM），這是一個端到端系統，配備了聽和說兩個通道。我們的LSLM採用基於標記的僅解碼器TTS進行語音生成，並使用流式自監督學習（SSL）編碼器進行實時音頻輸入。LSLM融合了兩個通道進行自回歸生成，並實時檢測交替對話。我們探索了三種融合策略——早期融合、中間融合和晚期融合，其中中間融合實現了語音生成和實時互動之間的最佳平衡。兩種實驗設置，基於命令的FDM和基於語音的FDM，展示了LSLM對噪音的穩健性和對多樣指令的敏感性。我們的結果突顯了LSLM實現雙工通信的能力，對現有系統影響最小。本研究旨在推動互動式語音對話系統的發展，增強其在現實世界情境中的應用性。

English

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

語言模型可以在說話的同時聆聽。

Language Model Can Listen While Speaking

摘要

Support