BiPhone: Modelando Influências Fonéticas Interlinguísticas em Texto

Resumo

Um grande número de pessoas é forçado a usar a Web em um idioma no qual possui baixa proficiência devido a assimetrias tecnológicas. Textos escritos no segundo idioma (L2) por esses usuários frequentemente contêm um grande número de erros influenciados por seu idioma nativo (L1). Propomos um método para identificar confusões de fonemas (sons no L2 que um falante de L1 tende a confundir) para pares de L1 e L2. Essas confusões são então integradas a um modelo generativo (Bi-Phone) para produzir sinteticamente textos corrompidos em L2. Por meio de avaliações humanas, mostramos que o Bi-Phone gera corrupções plausíveis que variam entre diferentes L1s e também possuem ampla cobertura na Web. Também corrompemos o popular benchmark de compreensão de linguagem SuperGLUE com nossa técnica (FunGLUE, ou GLUE com Ruído Fonético) e demonstramos que os modelos state-of-the-art (SoTA) de compreensão de linguagem têm desempenho ruim. Além disso, introduzimos uma nova tarefa de pré-treinamento de previsão de fonemas que ajuda modelos baseados em bytes a recuperar um desempenho próximo ao do SuperGLUE. Por fim, também disponibilizamos o benchmark FunGLUE para promover mais pesquisas em modelos de linguagem foneticamente robustos. Até onde sabemos, o FunGLUE é o primeiro benchmark a introduzir interações L1-L2 em textos.

English

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

BiPhone: Modelando Influências Fonéticas Interlinguísticas em Texto

BiPhone: Modeling Inter Language Phonetic Influences in Text

Resumo

Support