台风-S：面向自主大语言模型的极简开放后训练

摘要

大型语言模型（LLMs）发展迅速，但当前最先进的模型主要基于英语和汉语等高资源语言进行训练与评估，且多由少数拥有大规模算力和数据资源的机构开发。这种技术垄断为主权应用场景设置了实际障碍——在资源有限且需严格遵守透明性要求的条件下，区域或国家层面的机构及领域所有者需对模型权重、训练数据及部署保持控制与理解能力。为此我们提出两大核心需求：（1）可适配性，即将基础模型转化为通用助手的能力；（2）主权能力，即执行高风险区域性特定任务的能力（如使用本地语言进行法律推理及文化知识处理）。我们探究是否无需大规模指令数据集或复杂偏好调优流程与大规模强化微调（RFT）即可实现这些目标。本文提出Typhoon S方案，这是一种极简开放式后训练方法，融合监督微调、同策略蒸馏与小规模RFT。以泰语作为代表性案例，我们证明该方法可将主权适配型与通用型基础模型转化为具有强劲通用性能的指令调优模型。进一步研究发现，采用InK-GRPO（通过添加下一词预测损失扩展GRPO损失函数）的小规模RFT能提升泰语法律推理与泰国特定知识处理能力，同时保持通用性能。实验结果表明，精心设计的后训练策略可降低指令数据与计算资源的规模需求，为学术级资源条件下开发高质量主权LLMs提供了可行路径。

English

Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.

台风-S：面向自主大语言模型的极简开放后训练

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

摘要

Support