TeleChat 技术报告

摘要

在这份技术报告中，我们介绍了TeleChat，这是一个包含30亿、70亿和120亿参数的大型语言模型（LLMs）集合。它包括预训练的语言模型以及与人类偏好相一致的微调聊天模型。TeleChat最初在包含来自英语和中文语言的各种文本的广泛语料库上进行预训练，包括数万亿的标记。随后，模型经过微调以与人类偏好一致，遵循我们描述的详细方法论。我们评估了TeleChat在各种任务上的性能，包括语言理解、数学、推理、代码生成和基于知识的问答。我们的研究结果表明，TeleChat在各种公共基准测试中取得了与其他开源模型相当的性能，这些模型大小相似。为了支持未来利用LLMs的研究和应用，我们向公众社区发布了TeleChat的7B和12B变体的微调模型检查点，以及代码和部分预训练数据。

English

In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

TeleChat 技术报告

TeleChat Technical Report

摘要

Support