번역에서 길을 잃었나? 라틴어에서 오크어로의 문법적 성 변화 탐구

초록

라틴어에서 로망스어로의 통시적 진화는 문법적 성 체계가 대부분의 로망스어에서 삼분 체계(남성, 여성, 중성)에서 이분 체계(남성, 여성)로 재구조화되는 과정을 수반했다. 본 연구에서는 어휘 및 맥락 수준에서 이 현상을 조사하기 위해 해석 가능한 딥러닝 프레임워크를 도입한다. 먼저, 기존 토큰화 전략이 이 저자원 역사적 환경에서 충분히 강건하지 않으며, 제안된 토크나이저가 이러한 기준 모델 대비 성능을 향상시킴을 보여준다. 어휘 수준에서는 형태적 특징이 성 예측에 기여하는 정도를 평가한다. 맥락 수준에서는 다양한 품사 범주가 문법적 성 예측에 기여하는 정도를 정량화한다. 이러한 분석들은 함께 표제어와 문장 맥락 간 성 정보의 분포를 특성화한다. 코드베이스, 데이터셋 및 결과는 https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}에서 공개적으로 이용 가능함을 밝힌다.

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.