迷失在翻译中？探索从拉丁语到奥克语的语法性别转变

摘要

拉丁语向罗曼语系的历时演变过程中，多数罗曼语的语法性别系统经历重组，从三分结构（阳性、阴性、中性）变为二分结构（阳性、阴性）。本研究提出一种可解释的深度学习框架，从词汇和语境两个层面探讨这一现象。首先，我们发现传统分词策略在这种低资源历史环境下鲁棒性不足，而我们所提出的分词器性能优于这些基线方法。在词汇层面，我们评估了形态特征对性别预测的贡献；在语境层面，则量化了不同词性类别对语法性别预测的影响。这些分析共同揭示了词元与其句子语境之间性别信息的分布特征。我们将代码库、数据集及结果公开于https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}。

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.