F2LLM-v2：多言語世界のための包括的、高性能、かつ効率的な埋め込み表現

要旨

我々は、80Mから14Bまでの8種類のサイズを備えた新しい汎用多言語埋め込みモデル群「F2LLM-v2」を発表します。新たに精選された6,000万件の公開高品質データサンプルで学習されたF2LLM-v2は、200以上の言語をサポートし、特に従来十分なリソースが提供されていなかった中・低リソース言語に重点を置いています。LLMベースの2段階埋め込み学習パイプラインをマトリョーシカ学習、モデル枝刈り、知識蒸留技術と統合することで、従来のLLMベース埋め込みモデルよりもはるかに効率的でありながら競争力のある性能を維持するモデルを実現しました。大規模な評価により、F2LLM-v2-14Bが11のMTEBベンチマークで首位を獲得し、ファミリー内の小型モデルもリソース制約のあるアプリケーションにおいて新たな技術水準を確立することが確認されました。オープンソースの埋め込みモデル研究を促進するため、全てのモデル、データ、コード、および中間チェックポイントを公開します。

English

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

F2LLM-v2：多言語世界のための包括的、高性能、かつ効率的な埋め込み表現

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

要旨

Support