ChatPaper.aiChatPaper

太阳能开放技术报告

Solar Open Technical Report

January 11, 2026
作者: Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh
cs.AI

摘要

我们推出Solar Open,这是一款拥有1020亿参数的双语专家混合模型,专为资源稀缺语言设计。该模型通过解决三个相互关联的挑战,展示了构建具有竞争力大语言模型的系统化方法论。首先,针对资源稀缺语言训练数据不足的问题,我们合成了4.5万亿个高质量、领域特定且强化学习导向的标记数据。其次,我们通过渐进式课程学习协调这些数据,在20万亿标记规模上联合优化数据构成、质量阈值和领域覆盖。第三,为实现可扩展强化学习的推理能力,我们应用自主研发的SnapPO框架进行高效优化。在英语和韩语的基准测试中,Solar Open展现出具有竞争力的性能,验证了该方法论在资源稀缺语言人工智能开发中的有效性。
English
We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
PDF501January 15, 2026