NeoBERT: 次世代BERT

要旨

最近の建築、事前学習、微調整の革新により、LLaMAやDeepSeekなどの大規模な自己回帰言語モデルの驚異的な文脈学習と推論能力が実現されました。一方、BERTやRoBERTaなどのエンコーダは、多くのNLPアプリケーションで基盤となっているにもかかわらず、同じレベルの進歩を遂げていませんでした。このギャップを埋めるために、私たちはNeoBERTを導入しました。NeoBERTは、最先端の進歩を統合した次世代のエンコーダであり、建築、現代のデータ、最適化された事前学習手法の能力を再定義しています。NeoBERTはシームレスな採用を目指して設計されており、既存のベースモデルのプラグアンドプレイ置換として機能し、最適な深さ対幅比率に依存し、4,096トークンの拡張されたコンテキスト長を活用しています。コンパクトな250Mパラメータフットプリントにもかかわらず、Massive MTEBベンチマークで最先端の結果を達成し、BERT Large、RoBERTa Large、NomicBERT、ModernBERTを同一の微調整条件下で凌駕しています。さらに、GLUEへの各変更の影響を厳密に評価し、MTEB用の一貫した微調整および評価フレームワークを設計しています。研究と実世界での採用を加速するために、コード、データ、チェックポイント、トレーニングスクリプトをすべて公開しています。

English

Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

NeoBERT: 次世代BERT

NeoBERT: A Next-Generation BERT

要旨

Support