ChatPaper.aiChatPaper

太陽能開放技術報告

Solar Open Technical Report

January 11, 2026
作者: Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh
cs.AI

摘要

我們推出Solar Open,這是一個擁有1020億參數的雙語專家混合模型,專為資源匱乏語言設計。該模型通過解決三個相互關聯的挑戰,展現了構建具競爭力大型語言模型的系統性方法。首先,針對資源匱乏語言數據稀缺的問題,我們合成了4.5兆個高質量、領域特定且強化學習導向的詞元數據。其次,我們通過漸進式課程學習協調這些數據,在20兆詞元的規模上聯合優化數據組合、質量閾值與領域覆蓋率。第三,為實現可擴展強化學習的推理能力,我們應用自主提出的SnapPO框架進行高效優化。在英語與韓語的基準測試中,Solar Open展現出極具競爭力的性能,驗證了此方法論對資源匱乏語言AI發展的有效性。
English
We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
PDF501January 15, 2026