乒乓:多轮语码转换对话的自然基准测试平台
PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues
January 24, 2026
作者: Mohammad Rifqi Farhansyah, Hanif Muhammad Zhafran, Farid Adilazuarda, Shamsuddeen Hassan Muhammad, Maryam Ibrahim Mukhtar, Nedjma Ousidhoum, Genta Indra Winata, Ayu Purwarianti, Alham Fikri Aji
cs.AI
摘要
语码转换在全球多语使用者群体中普遍存在,但现有基准测试难以准确反映其日常交流的复杂性。我们推出PingPong基准测试,涵盖五种语言组合变体(部分为三语)的自然多方言语码转换对话数据集。该数据集收录2至4人参与的人工编写对话,呈现真实场景中多线程对话结构——应答常指向对话前段内容。研究表明,相较于机器生成方案,本数据在信息长度、发言者主导性和应答跨度上更具变化性,对话结构更自然多元。基于这些对话,我们设定三项下游任务:问答系统、对话摘要和主题分类。多款前沿语言模型在PingPong上的评估表明,现有系统对语码转换输入的处理能力仍显不足,这凸显了开发能应对现实世界多语交流复杂性的稳健自然语言处理系统的迫切需求。
English
Code-switching is a widespread practice among the world's multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of which are trilingual. Our dataset consists of human-authored conversations among 2 to 4 participants covering authentic, multi-threaded structures where replies frequently reference much earlier points in the dialogue. We demonstrate that our data is significantly more natural and structurally diverse than machine-generated alternatives, offering greater variation in message length, speaker dominance, and reply distance. Based on these dialogues, we define three downstream tasks: Question Answering, Dialogue Summarization, and Topic Classification. Evaluations of several state-of-the-art language models on PingPong reveal that performance remains limited on code-switched inputs, underscoring the urgent need for more robust NLP systems capable of addressing the intricacies of real-world multilingual discourse.