推理模型在機器翻譯中的測試時間縮放

摘要

測試時擴展（TTS）已提升了推理模型（RMs）在數學和編程等多種任務上的表現，然而其在機器翻譯（MT）中的效能仍未被充分探討。本文探討了增加推理時計算是否能夠提升翻譯質量。我們評估了12個RMs在多個領域的MT基準測試套件中的表現，考察了三種情境：直接翻譯、強制推理外推以及後編輯。我們的研究發現，對於通用RMs而言，TTS在直接翻譯中提供的效益有限且不一致，性能迅速達到瓶頸。然而，通過領域特定的微調，TTS的有效性得以釋放，這使得模型的推理過程與任務需求對齊，從而帶來直至最佳自我確定推理深度的持續改進。我們還發現，強制模型超越其自然停止點進行推理，會持續降低翻譯質量。相比之下，TTS在後編輯情境中表現出極高的效能，可靠地將自我修正轉化為有益的過程。這些結果表明，推理時計算在MT中的價值不在於用通用模型增強單次翻譯，而在於多步驟自我修正工作流程等目標應用，以及與任務專用模型結合使用。

English

Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model's reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.

推理模型在機器翻譯中的測試時間縮放

Test-Time Scaling of Reasoning Models for Machine Translation

摘要

Support