Meltemi：第一個針對希臘語的開放式大型語言模型

摘要

我們描述了Meltemi 7B的開發和功能，這是第一個針對希臘語言的開放式大型語言模型。Meltemi 7B擁有70億個參數，並在一個包含400億標記的希臘語語料庫上進行訓練。為了開發Meltemi 7B，我們通過在希臘語語料庫上持續預訓練來適應Mistral。Meltemi 7B包含截至2023年9月的最新信息。此外，我們還翻譯並整理了一個希臘語指令語料庫，用於調整聊天模型Meltemi 7B Instruct。對於Meltemi 7B Instruct，我們特別注意了對齊和去除有害內容。開發的模型在一系列收集的評估語料庫上進行評估，並呈現提示和回應的示例。Meltemi 7B和Meltemi 7B Instruct均可在https://huggingface.co/ilsp以Apache 2.0許可證下獲得。

English

We describe the development and capabilities of Meltemi 7B, the first open Large Language Model for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available at https://huggingface.co/ilsp under the Apache 2.0 license.

Meltemi：第一個針對希臘語的開放式大型語言模型

Meltemi: The first open Large Language Model for Greek

摘要

Support