MTR-DuplexBench：面向全雙工語音語言模型多輪對話的綜合性評估框架

摘要

全双工语音语言模型（FD-SLMs）能够实现实时重叠的对话交互，相较于传统半双工模型提供更具动态性的用户体验。然而，现有基准测试主要聚焦于评估单轮交互，忽视了多轮通信的复杂性。在多轮场景下评估FD-SLMs存在显著挑战，包括通信中话轮边界模糊以及模型推理过程中的上下文不一致性。此外，现有基准往往仅关注对话特征的评估，忽略了其他关键维度。为弥补这些不足，我们提出MTR-DuplexBench——一个专为全面评估FD-SLM多轮交互能力而设计的新型基准测试框架。该框架不仅将连续全双工对话分割为离散话轮进行逐轮评估，还整合了对话特征、对话质量、指令遵循及安全性等多维评估指标。实验结果表明，现有FD-SLMs在跨多轮次和多维评估中难以保持稳定性能，这印证了本基准测试的必要性与有效性。代码与数据详见：https://github.com/ZhangHe0918/MTR-DuplexBench

English

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench

MTR-DuplexBench：面向全雙工語音語言模型多輪對話的綜合性評估框架

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

摘要

Support