카나나: 계산 효율적인 이중 언어 모델

초록

우리는 한국어에서 탁월한 성능을 보이며 영어에서도 경쟁력 있는 성능을 보이는 이중 언어 모델 시리즈인 Kanana를 소개합니다. Kanana의 계산 비용은 유사한 규모의 최첨단 모델에 비해 상당히 낮습니다. 본 보고서는 계산 효율적이면서도 경쟁력 있는 모델을 달성하기 위해 사전 학습 과정에서 사용된 기술들을 상세히 설명하며, 이에는 고품질 데이터 필터링, 단계적 사전 학습, 깊이 확장, 가지치기 및 지식 증류가 포함됩니다. 또한, Kanana 모델의 사후 학습 단계에서 사용된 방법론들을 개괄하며, 이는 사용자와의 원활한 상호작용을 강화하기 위한 지도 미세 조정과 선호 최적화를 포함합니다. 마지막으로, 특정 시나리오에 대한 언어 모델 적응을 위해 사용된 가능한 접근 방식들, 예를 들어 임베딩, 검색 강화 생성, 함수 호출 등에 대해 상세히 설명합니다. Kanana 모델 시리즈는 2.1B에서 32.5B 파라미터 규모로 구성되며, 한국어 언어 모델 연구를 촉진하기 위해 2.1B 모델(기본, 지시, 임베딩)을 공개하였습니다.

English

We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

카나나: 계산 효율적인 이중 언어 모델

Kanana: Compute-efficient Bilingual Language Models

초록

Support