ChatPaper.aiChatPaper

F2LLM-v2:面向多语言世界的包容性、高性能且高效的嵌入技术

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

March 19, 2026
作者: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
cs.AI

摘要

我们推出F2LLM-v2系列——包含8种参数量(从8000万到140亿)的通用多语言嵌入模型。该系列基于新构建的6000万公开高质量数据样本复合训练集进行训练,支持超过200种语言,尤其关注以往资源不足的中低资源语言。通过融合基于大语言模型的双阶段嵌入训练流程、套娃学习、模型剪枝和知识蒸馏技术,我们实现了远超以往基于大语言模型的嵌入模型的效率,同时保持卓越性能。大量评估证实,F2LLM-v2-14B在11项MTEB基准测试中位列第一,而系列中较小参数量的模型也为资源受限场景设立了新的性能标杆。为促进开源嵌入模型研究,我们已全面公开所有模型、数据、代码及中间检查点。
English
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
PDF211March 21, 2026