Kanana:高效能雙語語言模型
Kanana: Compute-efficient Bilingual Language Models
February 26, 2025
作者: Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
cs.AI
摘要
我們推出Kanana系列雙語語言模型,其在韓語表現上超越同儕,在英語表現上亦具競爭力。Kanana的計算成本顯著低於同規模的頂尖模型。本報告詳述了在預訓練階段所採用的技術,以實現計算效率與性能兼備的模型,包括高品質數據過濾、分階段預訓練、深度擴展以及剪枝與蒸餾。此外,報告概述了Kanana模型在後訓練階段所採用的方法,涵蓋監督式微調與偏好優化,旨在提升其與用戶無縫互動的能力。最後,報告詳細闡述了將語言模型適應特定場景的可行方法,如嵌入、檢索增強生成及函數調用。Kanana模型系列參數量從21億至325億不等,其中21億參數的模型(基礎版、指令版、嵌入版)已公開發布,以促進韓語語言模型的研究。
English
We introduce Kanana, a series of bilingual language models that demonstrate
exceeding performance in Korean and competitive performance in English. The
computational cost of Kanana is significantly lower than that of
state-of-the-art models of similar size. The report details the techniques
employed during pre-training to achieve compute-efficient yet competitive
models, including high quality data filtering, staged pre-training, depth
up-scaling, and pruning and distillation. Furthermore, the report outlines the
methodologies utilized during the post-training of the Kanana models,
encompassing supervised fine-tuning and preference optimization, aimed at
enhancing their capability for seamless interaction with users. Lastly, the
report elaborates on plausible approaches used for language model adaptation to
specific scenarios, such as embedding, retrieval augmented generation, and
function calling. The Kanana model series spans from 2.1B to 32.5B parameters
with 2.1B models (base, instruct, embedding) publicly released to promote
research on Korean language models.Summary
AI-Generated Summary