ChatPaper.aiChatPaper

Yi: 01.AI 的開放基礎模型

Yi: Open Foundation Models by 01.AI

March 7, 2024
作者: 01. AI, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
cs.AI

摘要

我們介紹了Yi模型系列,這是一系列展示出強大多維能力的語言和多模型。Yi模型系列基於6B和34B預訓練語言模型,然後我們將其擴展為聊天模型、200K長上下文模型、深度擴展模型和視覺語言模型。我們的基本模型在各種基準測試中表現出色,如MMLU,而我們微調的聊天模型在AlpacaEval和Chatbot Arena等主要評估平台上取得了強大的人類偏好率。借助我們可擴展的超級計算基礎設施和經典的Transformer架構,我們主要將Yi模型的性能歸因於我們的數據工程工作所產生的數據質量。在預訓練方面,我們使用級聯數據去重和質量過濾管道構建了3100億個英文和中文語料庫。在微調方面,我們通過多次迭代對小規模(不到10K)的指令數據集進行了精煉,以便每個實例都經過我們的機器學習工程師直接驗證。對於視覺語言,我們將聊天語言模型與視覺Transformer編碼器相結合,並訓練模型將視覺表示對齊到語言模型的語義空間。我們通過輕量級持續預訓練將上下文長度擴展到200K,展示了強大的大海捞针檢索性能。我們展示了通過持續預訓練擴展預訓練檢查點的深度進一步改善了性能。我們相信,根據我們目前的結果,繼續通過經過徹底優化的數據來擴大模型參數,將會帶來更強大的前沿模型。
English
We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.
PDF663December 15, 2024