F2LLM-v2: 포용적이고 고성능이며 효율적인 다중 언어 임베딩

초록

F2LLM-v2는 80M에서 14B에 이르는 8가지 크기로 구성된 새로운 범용 다국어 임베딩 모델 패밀리를 소개합니다. 공개된 6천만 개의 고품질 데이터 샘플로 새롭게 구성된 복합 데이터셋으로 학습된 F2LLM-v2는 200개 이상의 언어를 지원하며, 특히 기존에 지원이 부족했던 중간 및 저자원 언어에 중점을 둡니다. 2단계 LLM 기반 임베딩 학습 파이프라인을 마트료시카 학습, 모델 프루닝, 지식 증류 기술과 결합하여, 기존 LLM 기반 임베딩 모델 대비 훨씬 더 효율적이면서도 경쟁력 있는 성능을 유지하는 모델을 제시합니다. 폭넓은 평가를 통해 F2LLM-v2-14B가 11개의 MTEB 벤치마크에서 1위를 차지했으며, 패밀리 내 더 작은 모델들도 자원이 제한된 애플리케이션을 위한 새로운 최첨단 기술을 수립함을 확인했습니다. 오픈소스 임베딩 모델 연구를 촉진하기 위해 모든 모델, 데이터, 코드 및 중간 체크포인트를 공개합니다.

English

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

F2LLM-v2: 포용적이고 고성능이며 효율적인 다중 언어 임베딩

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

초록

Support