F2LLM-v2:面向多語言世界的包容性、高性能且高效的嵌入技術
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
March 19, 2026
作者: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
cs.AI
摘要
我們推出F2LLM-v2系列——一套涵蓋8種規格(從8000萬到140億參數)的通用多語言嵌入模型。該系列基於全新構建的6000萬個公開高質量數據樣本進行訓練,支援超過200種語言,特別關注以往資源匱乏的中低資源語種。通過融合兩階段基於大語言模型的嵌入訓練流程,並結合套娃學習、模型剪枝與知識蒸餾技術,我們實現了遠超以往大語言模型嵌入方案的效率,同時保持競爭力性能。大量評估證實,F2LLM-v2-14B在11項MTEB基準測試中奪冠,而系列中較小規格的模型也為資源受限場景樹立了全新性能標杆。為推動開源嵌入模型研究,我們將全面公開所有模型、數據、代碼及中間檢查點。
English
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.