Perdu dans la traduction ? Explorer le changement de genre grammatical du latin à l'occitan.

Résumé

L'évolution diachronique du latin aux langues romanes a impliqué une restructuration du système des genres grammaticaux, passant d'une configuration tripartite (masculin, féminin, neutre) à une configuration bipartite (masculin, féminin) dans la plupart des langues romanes. Dans ce travail, nous présentons un cadre d'apprentissage profond interprétable pour étudier ce phénomène aux niveaux lexical et contextuel. Premièrement, nous montrons que les stratégies de tokenisation conventionnelles sont insuffisamment robustes pour ce contexte historique à faibles ressources, et que notre tokeniseur proposé améliore les performances par rapport à ces références. Au niveau lexical, nous évaluons la contribution des traits morphologiques à la prédiction du genre. Au niveau contextuel, nous quantifions les contributions de différentes catégories de parties du discours à la prédiction du genre grammatical. Ensemble, ces analyses caractérisent la distribution de l'information de genre entre le lemme et son contexte phrastique. Nous rendons notre code source, nos ensembles de données et nos résultats disponibles publiquement à l'adresse https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.