TeleChat技術レポート

要旨

本技術レポートでは、30億、70億、120億パラメータの大規模言語モデル（LLM）群であるTeleChatを紹介します。TeleChatは、事前学習済みの言語モデルに加え、人間の嗜好に合わせてファインチューニングされたチャットモデルを含んでいます。TeleChatはまず、英語と中国語の多様なテキストを含む大規模なコーパス（数兆トークン規模）で事前学習されます。その後、本レポートで詳細に説明する方法論に従い、人間の嗜好に合わせてファインチューニングが行われます。TeleChatの性能は、言語理解、数学、推論、コード生成、知識ベースの質問応答など、さまざまなタスクで評価されました。その結果、TeleChatは類似サイズの他のオープンソースモデルと比較して、幅広い公開ベンチマークで同等の性能を達成することが示されました。LLMを活用した今後の研究と応用を支援するため、TeleChatの70億および120億パラメータモデルのファインチューニング済みチェックポイントとコード、および事前学習データの一部を公開コミュニティにリリースします。

English

In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

TeleChat技術レポート

TeleChat Technical Report

要旨

Support