CamemBERT 2.0: 完璧に熟成されたスマートなフランス語モデル

要旨

CamemBERTなどのフランス語言語モデルは、自然言語処理（NLP）タスクに広く採用されており、CamemBERTなどのモデルは月間400万以上のダウンロードがある。しかし、これらのモデルは、時系列概念のドリフトによる課題に直面しており、古いトレーニングデータが性能の低下につながる、特に新しいトピックや用語に遭遇した場合に顕著である。この問題は、現在の言語トレンドを反映した更新されたモデルの必要性を強調している。本論文では、これらの課題に対処するために設計されたCamemBERTベースモデルの2つの新バージョン、CamemBERTav2とCamemBERTv2を紹介する。CamemBERTav2はDeBERTaV3アーキテクチャに基づいており、より良い文脈理解のために置換トークン検出（RTD）目的を使用している。一方、CamemBERTv2はRoBERTaに基づいており、マスク言語モデリング（MLM）目的を使用している。両モデルは、より大規模でより新しいデータセットでトレーニングされており、より長いコンテキスト長とフランス語のトークナイザのトークン化性能を向上させる更新されたトークナイザを使用している。これらのモデルの性能を一般領域のNLPタスクや医療分野のタスクなどの特定領域のアプリケーションで評価し、さまざまなユースケースでの汎用性と効果を示す。結果は、これらの更新されたモデルが先行モデルを大幅に上回ることを示し、現代のNLPシステムにとって価値のあるツールとなっている。すべての新モデルおよび中間チェックポイントは、Huggingfaceで公開されている。

English

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.