ChatPaper.aiChatPaper

台风-S:主权大语言模型的极简开放式后训练

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

January 26, 2026
作者: Kunat Pipatanakul, Pittawat Taveekitworachai
cs.AI

摘要

大型语言模型(LLMs)发展迅猛,但当前最先进的模型主要基于英语和汉语等高资源语言进行训练与评估,且多由少数拥有大规模算力和数据资源的机构开发。这种技术壁垒为主权应用场景带来了实际障碍——在资源有限且需严格遵守透明度要求的条件下,区域或国家层面的机构及领域所有者仍需保持对模型权重、训练数据及部署的掌控与理解。为此我们提出两大核心需求:(1)可适配性:将基础模型转化为通用助手的能力;(2)主权能力:执行高风险区域性任务的能力(如使用本地语言进行法律推理及文化知识应用)。我们探究是否能在不依赖海量指令数据、复杂偏好调优流程或大规模强化微调(RFT)的前提下实现这些目标。本文提出Typhoon S方案,这是一种极简开放式后训练方法,结合了监督微调、同策略蒸馏与小规模RFT。以泰语作为代表性案例,我们证明该方法能将主权适配型与通用型基础模型转化为具有强劲通用性能的指令调优模型。进一步研究发现,采用InK-GRPO(通过添加下一词预测损失扩展GRPO损失函数)的小规模RFT可提升泰语法律推理与本土知识应用能力,同时保持通用性能。实验结果表明,精心设计的后训练策略能降低指令数据与计算资源的规模需求,为学术级资源条件下开发高质量主权LLMs提供了可行路径。
English
Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.
PDF84January 31, 2026