迷失於翻譯之中？探討從拉丁語到奧克語的語法性別轉變

摘要

從拉丁語到羅曼語族的歷時演變中，多數羅曼語言的語法性別系統經歷了從三分架構（陽性、陰性、中性）重組為二分架構（陽性、陰性）的過程。本研究提出一個可解釋的深度學習框架，分別從詞彙層面與語境層面探討此現象。首先，我們證明傳統的分詞策略在此低資源歷史語境中缺乏足夠穩健性，而我們提出的分詞器在這些基準方法上提升了效能。在詞彙層面，我們評估了形態特徵對性別預測的貢獻；在語境層面，我們量化了不同詞性類別對語法性別預測的貢獻。綜合這些分析，我們刻畫了性別資訊在詞元與其句法語境之間的分佈特徵。我們將程式碼庫、資料集與研究結果公開於 https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}。

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.