C3：面向双语口语对话模型的基准测试，探索复杂对话中的挑战

摘要

語音對話模型（Spoken Dialogue Models, SDMs）近年來因其能直接生成語音回應使用者口語查詢的能力而受到廣泛關注。儘管其普及度日益提升，但在全面理解並模擬人類對話的實際效能方面，研究仍存在空白。這與基於文本的大型語言模型（Large Language Models, LLMs）形成對比，後者得益於廣泛的基準測試。人類語音互動本質上比文本更為複雜，這源於語音對話特有的特性。歧義性是一個挑戰，它既來自語義因素如同音異義，也涉及語音層面的異形同音詞、異義同音詞及重音模式等。此外，上下文依賴性，如省略、共指及多輪互動，進一步增加了人類對話動態的複雜度。為揭示SDM發展的現狀並應對這些挑戰，本文提出了一個包含1,079個中英文實例的基準數據集。配合一個與人類判斷高度一致、基於LLM的評估方法，該數據集促進了對SDM在解決這些實際挑戰中表現的全面探索。

English

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.