F2LLM技术报告：利用600万开源数据实现与SOTA嵌入性能的匹配

摘要

我们推出F2LLM——基础到特征的大规模语言模型套件，包含三种规模的先进嵌入模型：0.6B、1.7B和4B。与以往需要大规模对比预训练、复杂训练流程及昂贵合成训练数据的顶尖嵌入模型不同，F2LLM直接基于开源非合成数据集中的600万查询-文档-负样本三元组对基础模型进行微调，在训练成本、模型规模与嵌入性能之间实现了优异平衡。在MTEB英语排行榜上，F2LLM-4B在约4B参数模型中位列第二，整体排名第七；而F2LLM-1.7B则在1B至2B规模模型中拔得头筹。为推动该领域未来研究，我们公开了模型、训练数据集及代码，将F2LLM确立为一项强大、可复现且经济高效的基准，为后续工作提供坚实支撑。

English

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

F2LLM技术报告：利用600万开源数据实现与SOTA嵌入性能的匹配

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

摘要

Support