CS-Sum: Un Benchmark per la Sintesi di Dialoghi con Code-Switching e i Limiti dei Modelli Linguistici di Grande Scala

Abstract

Il code-switching (CS) rappresenta una sfida significativa per i Large Language Models (LLMs), eppure la sua comprensibilità rimane poco esplorata in questi modelli. Introduciamo CS-Sum per valutare la comprensibilità del CS da parte degli LLMs attraverso la sintesi di dialoghi CS in inglese. CS-Sum è il primo benchmark per la sintesi di dialoghi CS tra mandarino-inglese (EN-ZH), tamil-inglese (EN-TA) e malese-inglese (EN-MS), con 900-1300 dialoghi annotati manualmente per ciascuna coppia linguistica. Valutando dieci LLMs, inclusi modelli open e closed-source, analizziamo le prestazioni attraverso approcci few-shot, translate-summarize e fine-tuning (LoRA, QLoRA su dati sintetici). I nostri risultati mostrano che, sebbene i punteggi sulle metriche automatiche siano elevati, gli LLMs commettono errori sottili che alterano il significato completo del dialogo. A tal fine, introduciamo i 3 tipi di errori più comuni che gli LLMs commettono quando gestiscono input CS. I tassi di errore variano tra le coppie CS e gli LLMs, con alcuni LLMs che mostrano errori più frequenti su determinate coppie linguistiche, sottolineando la necessità di un addestramento specializzato su dati code-switched.

English

Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.

CS-Sum: Un Benchmark per la Sintesi di Dialoghi con Code-Switching e i Limiti dei Modelli Linguistici di Grande Scala

CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Abstract

Support