ChatPaper.aiChatPaper

哈拉技術報告:大規模構建以阿拉伯語為核心的指令與翻譯模型

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

September 17, 2025
作者: Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
cs.AI

摘要

我們介紹了Hala,這是一個以阿拉伯語為核心的指令與翻譯模型家族,採用我們開發的翻譯調優流程構建。首先,我們將一個強大的阿拉伯語↔英語教師模型壓縮至FP8精度(實現了約兩倍的吞吐量提升且無質量損失),並利用其生成高保真的雙語監督數據。隨後,一個輕量級語言模型LFM2-1.2B在此數據上進行微調,用於將高質量的英語指令集翻譯成阿拉伯語,從而生成了一個專為指令跟隨定制的百萬級語料庫。我們訓練了參數量分別為350M、700M、1.2B和9B的Hala模型,並應用球面線性插值(slerp)合併技術,以平衡阿拉伯語專項能力與基礎模型的優勢。在以阿拉伯語為核心的基準測試中,Hala在“納米級”(≤2B)和“小型”(7-9B)類別中均取得了領先的成果,超越了其基礎模型。我們公開了模型、數據、評估方法及訓練配方,以加速阿拉伯語自然語言處理領域的研究。
English
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong ARleftrightarrowEN teacher to FP8 (yielding sim2times higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" (leq2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
PDF591September 18, 2025