ChatPaper.aiChatPaper

01.AI 的开放基础模型

Yi: Open Foundation Models by 01.AI

March 7, 2024
作者: 01. AI, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
cs.AI

摘要

我们介绍了Yi模型系列,这是一系列展示出强大多维能力的语言和多模态模型。Yi模型系列基于6B和34B的预训练语言模型,然后我们将其扩展为聊天模型、200K长上下文模型、深度放大模型和视觉-语言模型。我们的基础模型在诸如MMLU之类的广泛基准测试中表现出色,我们微调的聊天模型在AlpacaEval和Chatbot Arena等主要评估平台上获得了强大的人类偏好率。借助我们可扩展的超级计算基础设施和经典的Transformer架构,我们主要将Yi模型的性能归因于我们的数据工程工作所带来的数据质量。对于预训练,我们使用级联数据去重和质量过滤流水线构建了3100亿个英文和中文语料库的标记。对于微调,我们通过多次迭代对小规模(不到10K)的指令数据集进行了优化,以确保每个实例都经过我们的机器学习工程师直接验证。对于视觉-语言,我们将聊天语言模型与视觉Transformer编码器相结合,并训练模型将视觉表示对齐到语言模型的语义空间。我们通过轻量级持续预训练将上下文长度扩展到200K,并展示了强大的大海捞针检索性能。我们展示了通过持续预训练扩展预训练检查点的深度进一步提高了性能。我们相信,鉴于我们目前的结果,继续使用经过彻底优化的数据来扩大模型参数规模将会带来更强大的前沿模型。
English
We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.
PDF663December 15, 2024