TeleChat 技術報告

摘要

在這份技術報告中，我們介紹了TeleChat，這是一組具有30億、70億和120億參數的大型語言模型（LLMs）。它包括預訓練語言模型以及與人類偏好相符的微調聊天模型。TeleChat最初在包含來自英語和中文語言的各種文本的龐大語料庫上進行預訓練，其中包含數以兆計的標記。隨後，模型經過微調以符合人類偏好，遵循我們描述的詳細方法論。我們評估了TeleChat在各種任務上的表現，包括語言理解、數學、推理、代碼生成和基於知識的問答。我們的研究結果表明，TeleChat在廣泛的公共基準測試中實現了與其他開源模型相似尺寸的可比性能。為了支持利用LLMs進行未來研究和應用，我們向公眾社區釋出了TeleChat的7B和12B變體的微調模型檢查點，以及代碼和部分預訓練數據。

English

In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

TeleChat 技術報告

TeleChat Technical Report

摘要

Support