在多语言推理模型中推进语言混合的思维链方法
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
October 5, 2025
作者: Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu
cs.AI
摘要
近期前沿模型采用长链式思维推理来探索上下文中的解决方案空间,从而获得更强的性能。尽管许多研究致力于通过蒸馏技术构建更小但能力不减的模型,但大多聚焦于英语领域,对于特定语言的推理机制知之甚少。为填补这一空白,我们首次引入**语言混合链式思维(Language-Mixed CoT)**,这一推理框架在英语与目标语言间切换,利用英语作为锚点,在优化推理能力的同时最小化翻译误差。以韩语为例,我们精心构建了**Yi-Sang**数据集:包含来自网络问答、考试、STEM及编程领域的579万条原生韩语提示;由Qwen3-32B生成的370万条长推理轨迹;以及一个精选的26万条高价值子集。我们训练了九种不同规模(4B至35B)的模型,涵盖六个系列(如Qwen2.5、Llama-3.1、Gemma-3等)。其中,最佳模型**KO-REAson-35B**实现了顶尖性能,以64.0±25的平均分位居榜首,在九项基准测试中五项排名第一,其余四项位列第二。中小型模型同样显著受益,在评估的九项基准上平均提升18.6分。消融实验表明,**语言混合链式思维**比单语链式思维更为有效,同时带来了跨语言和多模态性能的提升。我们公开了数据整理流程、评估系统、数据集及模型,以推动特定语言推理研究的进步。数据与模型集合详见:https://huggingface.co/KOREAson。
English
Recent frontier models employ long chain-of-thought reasoning to explore
solution spaces in context and achieve stonger performance. While many works
study distillation to build smaller yet capable models, most focus on English
and little is known about language-specific reasoning. To bridge this gap, we
first introduct **Language-Mixed CoT**, a reasoning schema that switches
between English and a target language, using English as an anchor to excel in
reasoning while minimizing translation artificats. As a Korean case study, we
curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and
code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k
high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5,
Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves
state-of-the-art performance, with the highest overall average score (64.0 \pm
25), ranking first on 5/9 benchmarks and second on the remainder. Samller and
mid-sized models also benefit substantially, with an average improvement of
+18.6 points across teh evaluated nine benchmarks. Ablations show
**Language-Mixed CoT** is more effective than monolingual CoT, also resulting
in cross-lingual and mult-modal performance gains. We release our data-curation
pipeline, evaluation system, datasets, and models to advance research on
language-specific reasoning. Data and model collection:
https://huggingface.co/KOREAson.