BiPhone：建模文本中的跨语言语音影响

摘要

許多人因技術不對稱而被迫使用他們識字水平較低的語言瀏覽網頁。這些使用第二語言（L2）書寫的用戶通常會出現大量受其母語（L1）影響的錯誤。我們提出了一種方法來挖掘音素混淆（L1和L2之間可能混淆的聲音對）的對。這些混淆然後被輸入到一個生成模型（雙音素模型）中，用於合成產生受損的L2文本。通過人類評估，我們展示了雙音素模型生成了看似合理的損壞，這些損壞在不同的L1之間有所不同，並且在網頁上有廣泛的覆蓋範圍。我們還使用我們的技術（Phonetically Noised GLUE的FunGLUE）損壞了流行的語言理解基準SuperGLUE，並展示了當前最先進的語言理解模型表現不佳。我們還引入了一個新的音素預測預訓練任務，有助於字節模型恢復接近SuperGLUE的性能。最後，我們還發布了FunGLUE基準，以促進對音素響應語言模型的進一步研究。據我們所知，FunGLUE是第一個在文本中引入L1-L2交互作用的基準。

English

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

BiPhone：建模文本中的跨语言语音影响

BiPhone: Modeling Inter Language Phonetic Influences in Text

摘要

Support