CamemBERT 2.0:一种经过完善的更智能的法语语言模型
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
November 13, 2024
作者: Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah
cs.AI
摘要
法语语言模型,例如CamemBERT,在自然语言处理(NLP)任务中被广泛采用,像CamemBERT这样的模型每月下载量超过400万次。然而,这些模型面临时间概念漂移的挑战,即过时的训练数据导致性能下降,特别是在遇到新主题和术语时。这个问题强调了需要反映当前语言趋势的更新模型。在本文中,我们介绍了CamemBERT基础模型的两个新版本-CamemBERTav2和CamemBERTv2,旨在解决这些挑战。CamemBERTav2基于DeBERTaV3架构,利用替换标记检测(RTD)目标以获得更好的上下文理解,而CamemBERTv2基于RoBERTa构建,使用掩码语言建模(MLM)目标。这两个模型都是在更大规模和更新的数据集上进行训练,具有更长的上下文长度和增强法语分词性能的更新分词器。我们评估了这些模型在通用领域NLP任务和特定领域应用(如医学领域任务)上的性能,展示了它们在各种用例中的多功能性和有效性。我们的结果表明,这些更新的模型远远优于它们的前身,使它们成为现代NLP系统中有价值的工具。我们所有的新模型,以及中间检查点,都在Huggingface上公开提供。
English
French language models, such as CamemBERT, have been widely adopted across
industries for natural language processing (NLP) tasks, with models like
CamemBERT seeing over 4 million downloads per month. However, these models face
challenges due to temporal concept drift, where outdated training data leads to
a decline in performance, especially when encountering new topics and
terminology. This issue emphasizes the need for updated models that reflect
current linguistic trends. In this paper, we introduce two new versions of the
CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these
challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use
of the Replaced Token Detection (RTD) objective for better contextual
understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked
Language Modeling (MLM) objective. Both models are trained on a significantly
larger and more recent dataset with longer context length and an updated
tokenizer that enhances tokenization performance for French. We evaluate the
performance of these models on both general-domain NLP tasks and
domain-specific applications, such as medical field tasks, demonstrating their
versatility and effectiveness across a range of use cases. Our results show
that these updated models vastly outperform their predecessors, making them
valuable tools for modern NLP systems. All our new models, as well as
intermediate checkpoints, are made openly available on Huggingface.Summary
AI-Generated Summary