翻訳の迷い？ラテン語からオック語への文法的性の変化を探る

要旨

ラテン語からロマンス諸語への通時的変化においては、大半のロマンス諸語で文法性（ジェンダー）体系が三区分（男性・女性・中性）から二区分（男性・女性）へと再構築された。本研究では、この現象を語彙レベルおよび文脈レベルの両方で調査するための解釈可能な深層学習フレームワークを導入する。まず、従来のトークン化戦略はこの低リソースの歴史的設定に対して十分に頑健ではなく、我々が提案するトークナイザーがこれらのベースラインよりも性能を向上させることを示す。語彙レベルでは、形態的特徴がジェンダー予測に与える影響を評価する。文脈レベルでは、異なる品詞カテゴリが文法性予測に寄与する度合いを定量化する。これらの分析を組み合わせることで、レンマとその文中文脈との間におけるジェンダー情報の分布を特徴づける。コードベース、データセット、および結果はhttps://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}で公開している。

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.