塔+:在多语言大语言模型中架起通用性与翻译专长之桥
Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
June 20, 2025
作者: Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins
cs.AI
摘要
微調預訓練的大型語言模型(LLMs)已被證明是實現特定任務(如機器翻譯)達到最新性能的有效策略。然而,這種適應過程往往意味著犧牲通用能力,例如對話推理和指令遵循,從而削弱了系統在需要多種技能的實際應用中的效用。本文介紹了Tower+,這是一套旨在在翻譯和多語言通用文本能力方面均提供強勁性能的模型。我們通過引入一種基於Tower(Alves等,2024)的新穎訓練方法,實現了翻譯專業化與多語言通用能力之間的帕累托前沿,該方法包括持續預訓練、監督微調、偏好優化以及帶有可驗證獎勵的強化學習。在訓練的每個階段,我們精心生成和策劃數據,以增強翻譯以及涉及代碼生成、數學問題解決和通用指令遵循的任務性能。我們開發了多種規模的模型:2B、9B和72B。我們較小的模型通常優於更大的通用開源和專有LLMs(例如Llama 3.3 70B、GPT-4o)。我們最大的模型在高資源語言翻譯性能方面表現最佳,並在多語言Arena Hard評估和我們引入的IF-MT基準測試中取得了頂尖成績,該基準測試用於評估翻譯和指令遵循。我們的研究結果強調,在優化特定業務領域(如翻譯和本地化)的同時,有可能在通用能力方面與前沿模型相媲美。
English
Fine-tuning pretrained LLMs has been shown to be an effective strategy for
reaching state-of-the-art performance on specific tasks like machine
translation. However, this process of adaptation often implies sacrificing
general-purpose capabilities, such as conversational reasoning and
instruction-following, hampering the utility of the system in real-world
applications that require a mixture of skills. In this paper, we introduce
Tower+, a suite of models designed to deliver strong performance across both
translation and multilingual general-purpose text capabilities. We achieve a
Pareto frontier between translation specialization and multilingual
general-purpose capabilities by introducing a novel training recipe that builds
on Tower (Alves et al., 2024), comprising continued pretraining, supervised
fine-tuning, preference optimization, and reinforcement learning with
verifiable rewards. At each stage of training, we carefully generate and curate
data to strengthen performance on translation as well as general-purpose tasks
involving code generation, mathematics problem solving, and general
instruction-following. We develop models at multiple scales: 2B, 9B, and 72B.
Our smaller models often outperform larger general-purpose open-weight and
proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers
best-in-class translation performance for high-resource languages and top
results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we
introduce for evaluating both translation and instruction-following. Our
findings highlight that it is possible to rival frontier models in general
capabilities, while optimizing for specific business domains, such as
translation and localization.