생성적 표현 지시 튜닝

초록

모든 텍스트 기반 언어 문제는 생성(Generation) 또는 임베딩(Embedding)으로 축약될 수 있다. 현재의 모델들은 이 둘 중 하나에서만 우수한 성능을 보인다. 우리는 생성적 표현 지시 튜닝(Generative Representational Instruction Tuning, GRIT)을 소개하며, 이를 통해 대규모 언어 모델이 지시를 통해 생성 작업과 임베딩 작업을 구분하여 둘 모두를 처리하도록 훈련된다. 다른 오픈 모델들과 비교했을 때, 우리가 개발한 GritLM 7B는 Massive Text Embedding Benchmark(MTEB)에서 새로운 최첨단 성능을 달성했으며, 다양한 생성 작업에서 동일 규모의 모든 모델을 능가한다. 더욱 규모를 확장한 GritLM 8x7B는 우리가 시도한 모든 오픈 생성 언어 모델을 능가하면서도 여전히 최고 수준의 임베딩 모델 중 하나로 자리 잡았다. 특히, GRIT은 생성 또는 임베딩 데이터만을 대상으로 한 훈련과 동등한 성능을 보이므로, 성능 손실 없이 둘을 통합할 수 있음을 확인했다. 이러한 통합은 특히 긴 문서에 대해 Retrieval-Augmented Generation(RAG)의 속도를 60% 이상 향상시키는 등 여러 이점을 제공하며, 더 이상 별도의 검색 및 생성 모델이 필요하지 않게 된다. 모델, 코드 등은 https://github.com/ContextualAI/gritlm에서 자유롭게 이용할 수 있다.

English

All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. By scaling up further, GritLM 8x7B outperforms all open generative language models that we tried while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at https://github.com/ContextualAI/gritlm.