侍酒师：可扩展的开放式多轮音频预处理系统，面向全双工语音语言模型

摘要

随着人工智能范式从基于文本的大语言模型转向语音语言模型，能够实现实时自然人机交互的全双工系统需求日益增长。然而，此类模型的发展受限于高质量多说话人对话数据的稀缺性，现有大规模资源主要为单说话人或规模有限。针对自然对话中重叠发言、反馈信号等复杂动态的处理仍存在挑战，标准处理流程常面临说话人日志错误和语音识别幻觉问题。为弥补这一空白，我们提出了一种面向全双工模型的鲁棒且可扩展的开源数据处理流程。

English

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.