BiPhone: 텍스트 내 언어 간 음운적 영향 모델링

초록

기술적 비대칭성으로 인해 많은 사람들이 자신이 낮은 문해력을 가진 언어로 웹을 사용해야 하는 상황에 처해 있습니다. 이러한 사용자들이 제2언어(L2)로 작성한 텍스트는 종종 모국어(L1)의 영향을 받아 많은 오류를 포함하고 있습니다. 우리는 L1과 L2 쌍에 대해 음소 혼동(L1 화자가 L2의 특정 소리를 혼동할 가능성이 높은 경우)을 추출하는 방법을 제안합니다. 이러한 혼동은 생성 모델(Bi-Phone)에 적용되어 L2 텍스트의 부정확한 변형을 합성적으로 생성합니다. 인간 평가를 통해 Bi-Phone이 다양한 L1에 따라 다르면서도 웹에서 널리 발견될 수 있는 그럴듯한 오류를 생성한다는 것을 보여줍니다. 또한, 우리는 이 기술을 사용하여 인기 있는 언어 이해 벤치마크인 SuperGLUE를 변형시킨 FunGLUE(Phonetically Noised GLUE)를 제안하고, 최신 언어 이해 모델들이 이에 대해 낮은 성능을 보임을 입증합니다. 또한, 우리는 새로운 음소 예측 사전 학습 과제를 도입하여 바이트 모델이 SuperGLUE에 근접한 성능을 회복할 수 있도록 돕습니다. 마지막으로, 음성적으로 강건한 언어 모델 연구를 촉진하기 위해 FunGLUE 벤치마크를 공개합니다. 우리가 아는 한, FunGLUE는 텍스트에서 L1-L2 상호작용을 도입한 최초의 벤치마크입니다.

English

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

BiPhone: 텍스트 내 언어 간 음운적 영향 모델링

BiPhone: Modeling Inter Language Phonetic Influences in Text

초록

Support