F2LLM技術レポート：600万のオープンソースデータを用いたSOTA埋め込み性能の達成

要旨

F2LLM（Foundation to Feature Large Language Models）を紹介します。これは、0.6B、1.7B、4Bの3つのサイズで構成される最先端の埋め込みモデル群です。従来のトップランキングの埋め込みモデルとは異なり、大規模なコントラスティブ事前学習や複雑なトレーニングパイプライン、高価な合成トレーニングデータを必要とせず、F2LLMはオープンソースの非合成データセットからキュレートされた600万のクエリ-ドキュメント-ネガティブタプルに基づいてファウンデーションモデルから直接ファインチューニングされています。これにより、トレーニングコスト、モデルサイズ、埋め込み性能の間で強力なバランスを実現しています。MTEB英語リーダーボードでは、F2LLM-4Bは約4Bパラメータのモデルの中で2位、全体で7位にランクインし、F2LLM-1.7Bは1B-2Bサイズ範囲のモデルの中で1位にランクインしています。今後の研究を促進するため、モデル、トレーニングデータセット、コードを公開し、F2LLMを将来の研究における強力で再現可能かつ予算に優しいベースラインとして位置づけています。

English

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

F2LLM技術レポート：600万のオープンソースデータを用いたSOTA埋め込み性能の達成

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

要旨

Support