大型推理模型是优秀的翻译评估器吗?性能分析与提升策略
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
October 23, 2025
作者: Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
cs.AI
摘要
近期大型推理模型的进步在生成最终答案前引入了中间"思考"过程,显著提升了处理复杂下游任务的推理能力。然而,这类模型作为机器翻译质量评估工具的潜力尚未得到充分探索。我们首次系统分析了LRM作为评判者在机器翻译评估中的应用,发现关键挑战包括:需要定制化评估材料、对简单实例易出现"过度思考",以及评分机制导致的高估倾向。为解决这些问题,我们提出通过合成类人思考轨迹对LRM进行校准训练。在WMT24指标基准测试中,该方法在将思考预算降低约35倍的同时,使7B至32B不同规模的模型评估性能均获提升(如R1-Distill-Qwen-7B模型相关度提升8.7个百分点)。这些发现表明,经过高效校准的大型推理模型有望推动细粒度自动机器翻译评估的发展。
English
Recent advancements in large reasoning models (LRMs) have introduced an
intermediate "thinking" process prior to generating final answers, improving
their reasoning capabilities on complex downstream tasks. However, the
potential of LRMs as evaluators for machine translation (MT) quality remains
underexplored. We provides the first systematic analysis of LRM-as-a-judge in
MT evaluation. We identify key challenges, revealing LRMs require tailored
evaluation materials, tend to "overthink" simpler instances and have issues
with scoring mechanisms leading to overestimation. To address these, we propose
to calibrate LRM thinking by training them on synthetic, human-like thinking
trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this
approach largely reduces thinking budgets by ~35x while concurrently improving
evaluation performance across different LRM scales from 7B to 32B (e.g.,
R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These
findings highlight the potential of efficiently calibrated LRMs to advance
fine-grained automatic MT evaluation.