多語種教師：評估語言模型在多語種合成數據生成中的表現

摘要

基於語言模型生成監督微調資料以訓練小規模模型執行多語言任務的做法日漸普遍。然而教師模型的選擇往往缺乏系統性，通常直接採用現有最大規模模型，即便這類模型在非英語語言上可能存在明顯能力缺陷。此種做法易導致合成資料質量低下，並使學生模型的下游表現欠佳。本研究系統性探討了高效能多語言教師模型的關鍵特徵，透過我們提出的「多語能力評分」指標，將資料質量的內在衡量標準與學生模型的外在表現相結合：共評估10個語言模型、涵蓋6種類型學上多樣的語言，生成超過140萬條監督微調樣本並訓練240個學生模型。在測試模型中，Gemma 3 27B與Aya Expanse 32B在不同學生基礎模型架構中均展現出穩定的教學效能。進一步分析表明，僅靠模型規模無法有效預測教學效果；相反地，提示多樣性、長度及回應流暢度等資料質量特徵，可解釋93.3%以上的內在資料質量變異並預測學生模型表現。最後我們提出實務建議：匹配師生模型的架構族系、透過現有提示進行翻譯或回應，這些策略能有效提升資源匱乏語言的表現。本研究期望能推動多語言合成資料與語言模型開發中以數據為核心的相關研究。

English

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

多語種教師：評估語言模型在多語種合成數據生成中的表現

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

摘要

Support