BiPhone: Modellazione delle influenze fonetiche interlinguistiche nel testo

Abstract

Un gran numero di persone è costretto a utilizzare il Web in una lingua in cui ha una bassa alfabetizzazione a causa di asimmetrie tecnologiche. Il testo scritto nella seconda lingua (L2) da tali utenti spesso contiene un gran numero di errori influenzati dalla loro lingua madre (L1). Proponiamo un metodo per estrarre le confusioni fonemiche (suoni in L2 che un parlante L1 è probabile che confonda) per coppie di L1 e L2. Queste confusioni vengono poi integrate in un modello generativo (Bi-Phone) per produrre sinteticamente testo L2 corrotto. Attraverso valutazioni umane, dimostriamo che Bi-Phone genera corruzioni plausibili che variano tra le diverse L1 e hanno un'ampia copertura sul Web. Abbiamo anche corrotto il popolare benchmark di comprensione del linguaggio SuperGLUE con la nostra tecnica (FunGLUE per Phonetically Noised GLUE) e mostriamo che i modelli di comprensione del linguaggio allo stato dell'arte performano male. Introduciamo inoltre un nuovo task di pre-addestramento per la previsione di fonemi che aiuta i modelli basati su byte a recuperare prestazioni vicine a quelle di SuperGLUE. Infine, rilasciamo anche il benchmark FunGLUE per promuovere ulteriori ricerche su modelli di linguaggio foneticamente robusti. Per quanto ne sappiamo, FunGLUE è il primo benchmark a introdurre interazioni L1-L2 nel testo.

English

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

BiPhone: Modellazione delle influenze fonetiche interlinguistiche nel testo

BiPhone: Modeling Inter Language Phonetic Influences in Text

Abstract

Support