ModernBERTとDeBERTaV3？Transformerエンコーダモデルの性能に対するアーキテクチャとデータの影響を検証

要旨

DeBERTaV3やModernBERTのような事前学習済みトランスフォーマーエンコーダモデルは、効率性と性能向上を目的としたアーキテクチャの進化を導入しています。ModernBERTの著者らは、いくつかのベンチマークでDeBERTaV3を上回る性能を報告していますが、トレーニングデータの開示がなく、共有データセットを用いた比較が行われていないため、これらの向上がアーキテクチャの改良によるものか、トレーニングデータの違いによるものかを判断するのは困難です。本研究では、ModernBERTをCamemBERTaV2（DeBERTaV3のフランス語モデル）と同じデータセットで事前学習させることで制御された実験を行い、モデル設計の効果を分離しました。その結果、前世代のモデルがサンプル効率と全体的なベンチマーク性能において依然として優れていることが示され、ModernBERTの主な利点はトレーニングと推論速度の速さであることが明らかになりました。しかし、新たに提案されたモデルは、BERTやRoBERTaのような以前のモデルと比較しても有意なアーキテクチャの改良を提供しています。さらに、高品質な事前学習データは収束を加速しますが、最終的な性能を大幅に向上させるわけではないことも観察され、ベンチマークの飽和を示唆しています。これらの発見は、トランスフォーマーモデルを評価する際に、事前学習データとアーキテクチャの革新を分離することの重要性を示しています。

English

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

ModernBERTとDeBERTaV3？Transformerエンコーダモデルの性能に対するアーキテクチャとデータの影響を検証

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

要旨

Support