F2LLM 기술 보고서: 600만 개의 오픈소스 데이터로 SOTA 임베딩 성능 달성

초록

F2LLM(Foundation to Feature Large Language Models)을 소개합니다. F2LLM은 0.6B, 1.7B, 4B 세 가지 크기의 최첨단 임베딩 모델 제품군입니다. 기존의 최고 수준 임베딩 모델들이 대규모 대조 사전 학습, 복잡한 학습 파이프라인, 고가의 합성 학습 데이터를 필요로 했던 것과 달리, F2LLM은 오픈소스 비합성 데이터셋에서 선별된 600만 개의 쿼리-문서-네거티브 튜플을 기반으로 파운데이션 모델에서 직접 미세 조정되었습니다. 이를 통해 학습 비용, 모델 크기, 임베딩 성능 간의 강력한 균형을 달성했습니다. MTEB 영어 리더보드에서 F2LLM-4B는 약 4B 파라미터 모델 중 2위, 전체 모델 중 7위를 기록했으며, F2LLM-1.7B는 1B-2B 크기 범위의 모델 중 1위를 차지했습니다. 향후 연구를 촉진하기 위해 모델, 학습 데이터셋 및 코드를 공개함으로써, F2LLM을 미래 연구를 위한 강력하고 재현 가능하며 경제적인 기준선으로 자리매김하고자 합니다.

English

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

F2LLM 기술 보고서: 600만 개의 오픈소스 데이터로 SOTA 임베딩 성능 달성

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

초록

Support