ChatPaper.aiChatPaper

Tower+:在多语言大模型中实现通用性与翻译专长间的桥梁

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

June 20, 2025
作者: Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins
cs.AI

摘要

微调预训练的大型语言模型(LLMs)已被证明是实现特定任务(如机器翻译)最先进性能的有效策略。然而,这种适应过程往往意味着牺牲通用能力,如对话推理和指令遵循,从而限制了系统在需要多种技能的现实应用中的实用性。本文介绍了Tower+,一套旨在在翻译和多语言通用文本能力上均表现出色的模型。我们通过引入一种新颖的训练方案,在翻译专业化和多语言通用能力之间实现了帕累托前沿,该方案基于Tower(Alves等,2024),包括持续预训练、监督微调、偏好优化以及带有可验证奖励的强化学习。在训练的每个阶段,我们精心生成和筛选数据,以增强翻译任务以及涉及代码生成、数学问题解决和通用指令遵循的通用任务的表现。我们开发了多种规模的模型:2B、9B和72B。我们的较小模型通常优于更大的通用开放权重和专有LLMs(例如,Llama 3.3 70B、GPT-4o)。我们最大的模型在高资源语言的翻译性能上达到了业界最佳,并在多语言Arena Hard评估和我们引入的IF-MT基准测试中取得了顶尖成绩,该基准用于评估翻译和指令遵循能力。我们的研究结果表明,在优化特定业务领域(如翻译和本地化)的同时,有可能在通用能力上与前沿模型相媲美。
English
Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.
PDF11July 1, 2025