BaichuanSEED：透過引入競爭性大型語言模型基準，分享廣泛數據收集和去重的潛力

摘要

大型語言模型（LLM）的一般能力高度依賴於廣泛預訓練數據集的組成和選擇，這些數據集被一些機構視為商業機密。為了解決這個問題，我們將一個通用的數據處理流程的細節開源，並通過引入一個具有競爭力的LLM基線來驗證其有效性和潛力。具體而言，數據處理流程包括廣泛收集以擴大規模和重新加權以提高質量。然後，我們使用我們的流程處理了30億標記的7B模型BaichuanSEED的預訓練，沒有任何刻意針對下游任務的優化，然後進行簡單但有效的監督微調階段。BaichuanSEED在整個訓練過程中表現出一致性和可預測性，在多個全面基準測試中實現了與幾個商業先進大型語言模型（如Qwen1.5和Llama3）可比擬的性能。我們還進行了幾個啟發式實驗，討論了進一步優化下游任務（如數學和編碼）的潛力。

English

The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

BaichuanSEED：透過引入競爭性大型語言模型基準，分享廣泛數據收集和去重的潛力

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

摘要

Support