BiPhone：建模文本中的跨语言语音影响

摘要

由于技术不对称，许多人被迫在一种他们识字水平较低的语言中使用网络。这些用户在第二语言（L2）中的书面文本通常包含大量受其母语（L1）影响的错误。我们提出了一种方法，用于挖掘L1和L2之间的音素混淆（L1说话者可能混淆的L2中的声音对）。然后将这些混淆输入到一个生成模型（双音素模型）中，用于合成产生受损的L2文本。通过人类评估，我们展示了双音素模型生成的损坏是合理的，且在网络上具有广泛覆盖。我们还使用这种技术（Phonetically Noised GLUE的FunGLUE）来损坏流行的语言理解基准SuperGLUE，并展示了当前最先进的语言理解模型表现不佳。我们还引入了一个新的音素预测预训练任务，有助于字节模型恢复接近SuperGLUE的性能。最后，我们还发布了FunGLUE基准，以促进在音素鲁棒语言模型领域的进一步研究。据我们所知，FunGLUE是第一个在文本中引入L1-L2交互的基准。

English

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

BiPhone：建模文本中的跨语言语音影响

BiPhone: Modeling Inter Language Phonetic Influences in Text

摘要

Support