NeoBERT:新一代BERT模型
NeoBERT: A Next-Generation BERT
February 26, 2025
作者: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
cs.AI
摘要
近期在架構、預訓練和微調方面的創新,使得如LLaMA和DeepSeek等大型自回歸語言模型展現出卓越的上下文學習與推理能力。相比之下,儘管BERT和RoBERTa等編碼器為眾多下游自然語言處理應用奠定了基礎,卻未見同等程度的進步。為彌合這一差距,我們推出了NeoBERT,這是一款新一代的編碼器,它通過整合架構、現代數據及優化預訓練方法中的尖端技術,重新定義了雙向模型的能力。NeoBERT設計便於無縫採用:它作為現有基礎模型的即插即用替代品,依賴於最佳的深度與寬度比例,並利用長達4,096個標記的擴展上下文長度。儘管其參數規模僅為2.5億,卻在龐大的MTEB基準測試中取得了領先成果,在相同微調條件下,超越了BERT Large、RoBERTa Large、NomicBERT及ModernBERT。此外,我們嚴格評估了每項改進對GLUE的影響,並為MTEB設計了一套統一的微調與評估框架。我們公開了所有代碼、數據、檢查點及訓練腳本,以加速研究與實際應用。
English
Recent innovations in architecture, pre-training, and fine-tuning have led to
the remarkable in-context learning and reasoning abilities of large
auto-regressive language models such as LLaMA and DeepSeek. In contrast,
encoders like BERT and RoBERTa have not seen the same level of progress despite
being foundational for many downstream NLP applications. To bridge this gap, we
introduce NeoBERT, a next-generation encoder that redefines the capabilities of
bidirectional models by integrating state-of-the-art advancements in
architecture, modern data, and optimized pre-training methodologies. NeoBERT is
designed for seamless adoption: it serves as a plug-and-play replacement for
existing base models, relies on an optimal depth-to-width ratio, and leverages
an extended context length of 4,096 tokens. Despite its compact 250M parameter
footprint, it achieves state-of-the-art results on the massive MTEB benchmark,
outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under
identical fine-tuning conditions. In addition, we rigorously evaluate the
impact of each modification on GLUE and design a uniform fine-tuning and
evaluation framework for MTEB. We release all code, data, checkpoints, and
training scripts to accelerate research and real-world adoption.Summary
AI-Generated Summary