F2LLM技術報告:利用600萬開源數據實現頂尖嵌入性能匹配
F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
October 2, 2025
作者: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
cs.AI
摘要
我們推出F2LLM——基礎到特徵的大規模語言模型套件,包含三種規模的尖端嵌入模型:0.6B、1.7B和4B。與以往需要大規模對比預訓練、複雜訓練流程及昂貴合成訓練數據的頂級嵌入模型不同,F2LLM直接基於基礎模型,在從開源非合成數據集中精選的600萬個查詢-文檔-負樣本三元組上進行微調,在訓練成本、模型規模與嵌入性能之間達到了優異的平衡。在MTEB英文排行榜上,F2LLM-4B在約4B參數的模型中排名第二,總體排名第七;而F2LLM-1.7B則在1B至2B規模的模型中位居榜首。為推動該領域的未來研究,我們公開了模型、訓練數據集及代碼,將F2LLM定位為未來工作中一個強大、可重現且經濟高效的基準線。
English
We introduce F2LLM - Foundation to Feature Large Language Models, a suite of
state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike
previous top-ranking embedding models that require massive contrastive
pretraining, sophisticated training pipelines, and costly synthetic training
data, F2LLM is directly finetuned from foundation models on 6 million
query-document-negative tuples curated from open-source, non-synthetic
datasets, striking a strong balance between training cost, model size, and
embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd
among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B
ranks 1st among models in the 1B-2B size range. To facilitate future research
in the field, we release the models, training dataset, and code, positioning
F2LLM as a strong, reproducible, and budget-friendly baseline for future works.