Verloren in vertaling? Een verkenning van de verschuiving in grammaticaal geslacht van het Latijn naar het Occitaans

Samenvatting

De diachrone evolutie van het Latijn naar de Romaanse talen omvatte een herstructurering van het grammaticale geslachtssysteem van een driedelige configuratie (mannelijk, vrouwelijk, onzijdig) naar een tweedelige (mannelijk, vrouwelijk) in de meeste Romaanse talen. In dit werk introduceren we een interpreteerbaar deep learning-raamwerk om dit fenomeen zowel op lexicaal als op contextueel niveau te onderzoeken. Eerst tonen we aan dat conventionele tokenisatiestrategieën onvoldoende robuust zijn voor deze laag-resource historische setting, en dat onze voorgestelde tokenizer beter presteert dan deze baselines. Op lexicaal niveau evalueren we de bijdrage van morfologische kenmerken aan geslachtsvoorspelling. Op contextueel niveau kwantificeren we de bijdragen van verschillende woordsoortcategorieën aan de voorspelling van het grammaticale geslacht. Samen kenmerken deze analyses de verdeling van geslachtsinformatie tussen het lemma en de zinscontext. We maken onze codebase, datasets en resultaten openbaar beschikbaar op https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.

English

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.