C3：面向复杂对话挑战的双语口语对话模型基准测试

摘要

语音对话模型（SDMs）近期因其能够直接生成语音回应用户的口语查询而备受关注。尽管其日益普及，但在全面理解其实际效果、特别是在模拟人类对话方面的研究仍存在空白。与基于文本的大型语言模型（LLMs）相比，后者得益于广泛的基准测试，而人类语音交互由于口语对话特有的复杂性，显得更为复杂。歧义性是一个挑战，它源于语义因素如一词多义，以及语音学特征如异形同音词、异音同形词和重音模式。此外，上下文依赖性，如省略、共指和多轮互动，进一步增加了人类对话动态的复杂性。为了阐明SDM发展的现状并应对这些挑战，本文提出了一个包含1,079个中英文实例的基准数据集。配合一个与人类判断高度一致的基于LLM的评估方法，该数据集为深入探索SDMs在解决这些实际问题中的表现提供了有力支持。

English

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

C3：面向复杂对话挑战的双语口语对话模型基准测试

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

摘要

Support