Tower+：在多语言大模型中实现通用性与翻译专长间的桥梁

摘要

微调预训练的大型语言模型（LLMs）已被证明是实现特定任务（如机器翻译）最先进性能的有效策略。然而，这种适应过程往往意味着牺牲通用能力，如对话推理和指令遵循，从而限制了系统在需要多种技能的现实应用中的实用性。本文介绍了Tower+，一套旨在在翻译和多语言通用文本能力上均表现出色的模型。我们通过引入一种新颖的训练方案，在翻译专业化和多语言通用能力之间实现了帕累托前沿，该方案基于Tower（Alves等，2024），包括持续预训练、监督微调、偏好优化以及带有可验证奖励的强化学习。在训练的每个阶段，我们精心生成和筛选数据，以增强翻译任务以及涉及代码生成、数学问题解决和通用指令遵循的通用任务的表现。我们开发了多种规模的模型：2B、9B和72B。我们的较小模型通常优于更大的通用开放权重和专有LLMs（例如，Llama 3.3 70B、GPT-4o）。我们最大的模型在高资源语言的翻译性能上达到了业界最佳，并在多语言Arena Hard评估和我们引入的IF-MT基准测试中取得了顶尖成绩，该基准用于评估翻译和指令遵循能力。我们的研究结果表明，在优化特定业务领域（如翻译和本地化）的同时，有可能在通用能力上与前沿模型相媲美。

English

Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

Tower+：在多语言大模型中实现通用性与翻译专长间的桥梁

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

摘要

Support