机器翻译推理模型的测试时尺度调整
Test-Time Scaling of Reasoning Models for Machine Translation
October 7, 2025
作者: Zihao Li, Shaoxiong Ji, Jörg Tiedemann
cs.AI
摘要
测试时扩展(TTS)已在数学和编程等任务中提升了推理模型(RMs)的表现,但其在机器翻译(MT)领域的有效性仍待深入探究。本文探讨了增加推理时计算量是否能提升翻译质量。我们评估了12个RMs在涵盖多个领域的多样化MT基准测试中的表现,考察了三种场景:直接翻译、强制推理外推以及后期编辑。研究发现,对于通用型RMs而言,TTS在直接翻译上带来的益处有限且不稳定,性能很快达到瓶颈。然而,通过领域特定的微调,TTS的效用得以释放,这种微调使模型的推理过程与任务需求对齐,从而带来直至最优自定推理深度的一致改进。我们还发现,强制模型超越其自然停止点进行推理会持续降低翻译质量。相比之下,TTS在后期编辑场景中表现出色,可靠地将自我修正转化为有益过程。这些结果表明,在MT中,推理时计算的价值不在于用通用模型增强单次翻译,而在于多步骤自我修正工作流等针对性应用,以及与任务专用模型的结合使用。
English
Test-time scaling (TTS) has enhanced the performance of Reasoning Models
(RMs) on various tasks such as math and coding, yet its efficacy in machine
translation (MT) remains underexplored. This paper investigates whether
increased inference-time computation improves translation quality. We evaluate
12 RMs across a diverse suite of MT benchmarks spanning multiple domains,
examining three scenarios: direct translation, forced-reasoning extrapolation,
and post-editing. Our findings show that for general-purpose RMs, TTS provides
limited and inconsistent benefits for direct translation, with performance
quickly plateauing. However, the effectiveness of TTS is unlocked by
domain-specific fine-tuning, which aligns a model's reasoning process with task
requirements, leading to consistent improvements up to an optimal,
self-determined reasoning depth. We also find that forcing a model to reason
beyond its natural stopping point consistently degrades translation quality. In
contrast, TTS proves highly effective in a post-editing context, reliably
turning self-correction into a beneficial process. These results indicate that
the value of inference-time computation in MT lies not in enhancing single-pass
translation with general models, but in targeted applications like multi-step,
self-correction workflows and in conjunction with task-specialized models.