KORMo:面向大众的韩国开放推理模型
KORMo: Korean Open Reasoning Model for Everyone
October 10, 2025
作者: Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
cs.AI
摘要
本研究首次大规模探索了为非英语语言(具体为韩语)构建完全开放的双语大语言模型(LLM),该模型主要基于合成数据进行训练。我们推出了KORMo-10B,这是一个拥有108亿参数的模型,从头开始在韩英双语语料库上训练,其中韩语部分的68.74%为合成数据。通过系统性实验,我们证明,当合成数据经过精心筛选,确保语言覆盖均衡且指令风格多样时,不会在大规模预训练过程中引发模型不稳定或性能下降。此外,该模型在广泛的推理、知识及指令遵循基准测试中,表现与当前开放权重的多语言基线模型相当。我们的实验揭示了两个关键发现:(1)合成数据能够可靠地支持长期预训练,而不会导致模型崩溃;(2)双语指令微调使得模型在韩语推理和语篇连贯性上接近母语水平。通过全面公开包括数据、代码、训练方案及日志在内的所有组件,本研究为在低资源环境下开发基于合成数据的完全开放模型(FOMs)建立了一个透明框架,并为未来的多语言LLM研究树立了可复现的先例。
English
This work presents the first large-scale investigation into constructing a
fully open bilingual large language model (LLM) for a non-English language,
specifically Korean, trained predominantly on synthetic data. We introduce
KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English
corpus in which 68.74% of the Korean portion is synthetic. Through systematic
experimentation, we demonstrate that synthetic data, when carefully curated
with balanced linguistic coverage and diverse instruction styles, does not
cause instability or degradation during large-scale pretraining. Furthermore,
the model achieves performance comparable to that of contemporary open-weight
multilingual baselines across a wide range of reasoning, knowledge, and
instruction-following benchmarks. Our experiments reveal two key findings: (1)
synthetic data can reliably sustain long-horizon pretraining without model
collapse, and (2) bilingual instruction tuning enables near-native reasoning
and discourse coherence in Korean. By fully releasing all components including
data, code, training recipes, and logs, this work establishes a transparent
framework for developing synthetic data-driven fully open models (FOMs) in
low-resource settings and sets a reproducible precedent for future multilingual
LLM research.