侍酒师系统:面向全双工语音语言模型的可扩展开放式多轮音频预处理方案
Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
March 20, 2026
作者: Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeongoon Kim, Jaegul Choo, Cheonbok Park
cs.AI
摘要
随着人工智能范式从基于文本的大语言模型转向语音语言模型,能够实现实时自然人机交互的全双工系统需求日益增长。然而,此类模型的发展受限于高质量多说话人对话数据的稀缺性——现有大规模资源主要为单说话人数据或规模有限。处理自然对话中重叠发言和反馈回应等复杂动态特性仍具挑战,标准处理流程常受说话人日志错误和语音识别幻觉的困扰。为弥补这一空白,我们提出了一种面向全双工模型的鲁棒且可扩展的开源数据处理流程。
English
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.