CamemBERT 2.0：一种经过完善的更智能的法语语言模型

摘要

法语语言模型，例如CamemBERT，在自然语言处理（NLP）任务中被广泛采用，像CamemBERT这样的模型每月下载量超过400万次。然而，这些模型面临时间概念漂移的挑战，即过时的训练数据导致性能下降，特别是在遇到新主题和术语时。这个问题强调了需要反映当前语言趋势的更新模型。在本文中，我们介绍了CamemBERT基础模型的两个新版本-CamemBERTav2和CamemBERTv2，旨在解决这些挑战。CamemBERTav2基于DeBERTaV3架构，利用替换标记检测（RTD）目标以获得更好的上下文理解，而CamemBERTv2基于RoBERTa构建，使用掩码语言建模（MLM）目标。这两个模型都是在更大规模和更新的数据集上进行训练，具有更长的上下文长度和增强法语分词性能的更新分词器。我们评估了这些模型在通用领域NLP任务和特定领域应用（如医学领域任务）上的性能，展示了它们在各种用例中的多功能性和有效性。我们的结果表明，这些更新的模型远远优于它们的前身，使它们成为现代NLP系统中有价值的工具。我们所有的新模型，以及中间检查点，都在Huggingface上公开提供。

English

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.