CS-Sum:代码切换对话摘要的基准与大型语言模型的局限
CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models
May 19, 2025
作者: Sathya Krishnan Suresh, Tanmay Surana, Lim Zhi Hao, Eng Siong Chng
cs.AI
摘要
代碼轉換(Code-switching, CS)對大型語言模型(LLMs)構成了重大挑戰,然而其在LLMs中的可理解性仍未被充分探討。我們引入了CS-Sum,通過將CS對話轉換為英語摘要來評估LLMs對CS的理解能力。CS-Sum是首個涵蓋普通話-英語(EN-ZH)、泰米爾語-英語(EN-TA)和馬來語-英語(EN-MS)的CS對話摘要基準,每對語言包含900至1300條人工註釋的對話。通過評估包括開源和閉源模型在內的十種LLMs,我們分析了在少樣本學習、翻譯-摘要以及微調(LoRA、QLoRA基於合成數據)方法下的表現。我們的研究發現,儘管在自動化指標上的得分較高,但LLMs會犯下細微的錯誤,從而完全改變對話的意義。為此,我們介紹了LLMs在處理CS輸入時最常見的三類錯誤。錯誤率因CS語言對和LLMs的不同而異,某些LLMs在特定語言對上表現出更頻繁的錯誤,這凸顯了針對代碼轉換數據進行專門訓練的必要性。
English
Code-switching (CS) poses a significant challenge for Large Language Models
(LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce
CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue
to English summarization. CS-Sum is the first benchmark for CS dialogue
summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and
Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language
pair. Evaluating ten LLMs, including open and closed-source models, we analyze
performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA
on synthetic data) approaches. Our findings show that though the scores on
automated metrics are high, LLMs make subtle mistakes that alter the complete
meaning of the dialogue. To this end, we introduce 3 most common type of errors
that LLMs make when handling CS input. Error rates vary across CS pairs and
LLMs, with some LLMs showing more frequent errors on certain language pairs,
underscoring the need for specialized training on code-switched data.Summary
AI-Generated Summary