KORMo:面向大眾的韓國開放推理模型
KORMo: Korean Open Reasoning Model for Everyone
October 10, 2025
作者: Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
cs.AI
摘要
本研究首次大規模探討了為非英語語言(特別是韓語)構建完全開放的雙語大型語言模型(LLM),該模型主要基於合成數據進行訓練。我們推出了KORMo-10B,這是一個擁有108億參數的模型,從零開始在一個韓英雙語語料庫上訓練,其中68.74%的韓語部分為合成數據。通過系統性實驗,我們證明瞭當合成數據經過精心策劃,具備平衡的語言覆蓋和多樣的指令風格時,在大規模預訓練過程中不會導致不穩定或性能下降。此外,該模型在廣泛的推理、知識和指令遵循基準測試中,表現與當代開源多語言基線模型相當。我們的實驗揭示了兩個關鍵發現:(1)合成數據能夠可靠地支持長期預訓練而不會導致模型崩潰,(2)雙語指令微調能夠實現接近母語水平的韓語推理和話語連貫性。通過完全公開包括數據、代碼、訓練配方和日誌在內的所有組件,本研究為在低資源環境下開發基於合成數據的完全開放模型(FOMs)建立了一個透明的框架,並為未來多語言LLM研究樹立了可重複的先例。
English
This work presents the first large-scale investigation into constructing a
fully open bilingual large language model (LLM) for a non-English language,
specifically Korean, trained predominantly on synthetic data. We introduce
KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English
corpus in which 68.74% of the Korean portion is synthetic. Through systematic
experimentation, we demonstrate that synthetic data, when carefully curated
with balanced linguistic coverage and diverse instruction styles, does not
cause instability or degradation during large-scale pretraining. Furthermore,
the model achieves performance comparable to that of contemporary open-weight
multilingual baselines across a wide range of reasoning, knowledge, and
instruction-following benchmarks. Our experiments reveal two key findings: (1)
synthetic data can reliably sustain long-horizon pretraining without model
collapse, and (2) bilingual instruction tuning enables near-native reasoning
and discourse coherence in Korean. By fully releasing all components including
data, code, training recipes, and logs, this work establishes a transparent
framework for developing synthetic data-driven fully open models (FOMs) in
low-resource settings and sets a reproducible precedent for future multilingual
LLM research.