多语言教师：面向多语言合成数据生成的语言模型评估

摘要

利用语言模型合成监督微调数据以训练小模型执行多语言任务的做法日益普遍。然而教师模型的选择往往缺乏系统性，通常默认采用规模最大的可用模型，尽管这类模型在非英语语言上可能存在显著能力差距。这种做法可能导致合成数据质量低下，进而影响学生模型的下游性能。本研究系统性地探讨了高效多语言教师模型的特征，通过我们提出的"多语言能力评分"指标，将数据质量的内在衡量标准与学生模型的外在表现相结合：评估了涵盖6种类型学差异语言的10个语言模型，生成超过140万条监督微调样本，并训练了240个学生模型。在测试模型中，Gemma 3 27B和Aya Expanse 32B在不同学生基础模型架构中均展现出稳定的教学效果。进一步分析表明，仅凭模型规模不能有效预测教学效能；而提示多样性、响应长度及回答流畅度等数据质量特征可解释93.3%的内在质量差异，并能预测学生模型表现。最后我们提出实用建议，包括匹配师生模型的架构家族、基于现有提示进行翻译或响应，这些策略可提升资源稀缺语言的表现。本研究旨在推动多语言合成数据与语言模型开发中以数据为中心的研究进程。

English

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

多语言教师：面向多语言合成数据生成的语言模型评估

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

摘要

Support