Tower+: 多言語LLMにおける汎用性と翻訳特化性の橋渡し

要旨

事前学習済みの大規模言語モデル（LLM）をファインチューニングすることは、機械翻訳などの特定のタスクにおいて最先端の性能を達成するための効果的な戦略として示されてきました。しかし、この適応プロセスは、会話推論や指示追従などの汎用能力を犠牲にすることをしばしば意味し、複数のスキルを必要とする現実世界のアプリケーションにおけるシステムの有用性を損なうことがあります。本論文では、翻訳と多言語汎用テキスト能力の両方で優れた性能を発揮するように設計されたモデル群であるTower+を紹介します。我々は、Tower（Alves et al., 2024）を基盤とした新しいトレーニングレシピを導入することで、翻訳の専門性と多言語汎用能力の間のパレートフロンティアを達成しました。このレシピは、継続的な事前学習、教師ありファインチューニング、選好最適化、および検証可能な報酬を用いた強化学習を含みます。トレーニングの各段階において、翻訳だけでなく、コード生成、数学的問題解決、一般的な指示追従を含む汎用タスクの性能を強化するために、データを慎重に生成し、キュレーションしました。我々は、2B、9B、72Bという複数のスケールでモデルを開発しました。我々の小型モデルは、しばしばより大規模な汎用オープンウェイトおよびプロプライエタリLLM（例：Llama 3.3 70B、GPT-4o）を上回ります。我々の最大のモデルは、高リソース言語において最高クラスの翻訳性能を提供し、多言語Arena Hard評価および翻訳と指示追従の両方を評価するために導入したIF-MTベンチマークでトップの結果を達成します。我々の研究結果は、翻訳やローカライゼーションなどの特定のビジネスドメインを最適化しながら、汎用能力において最先端のモデルと競合することが可能であることを強調しています。

English

Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

Tower+: 多言語LLMにおける汎用性と翻訳特化性の橋渡し

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

要旨

Support