大型推理模型是优秀的翻译评估工具吗?性能分析与提升策略
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
October 23, 2025
作者: Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
cs.AI
摘要
近期大型推理模型(LRMs)的技术进展在生成最终答案前引入了中间"思考"过程,显著提升了处理复杂下游任务的推理能力。然而,LRMs作为机器翻译质量评估工具的潜力尚未得到充分探索。我们首次对LRMs作为翻译评估工具进行了系统性分析,发现其面临三大关键挑战:需要定制化评估材料、对简单实例容易"过度思考"、评分机制存在高估倾向。为解决这些问题,我们提出通过训练模型学习合成类人思考轨迹来校准LRM的思考过程。在WMT24度量基准测试中,该方法在7B至32B不同规模的LRMs上实现评估性能全面提升(例如R1-Distill-Qwen-7B模型相关性提升8.7个百分点),同时将思考预算大幅降低约35倍。这些发现表明,经过高效校准的LRMs有望推动细粒度自动机器翻译评估的发展。
English
Recent advancements in large reasoning models (LRMs) have introduced an
intermediate "thinking" process prior to generating final answers, improving
their reasoning capabilities on complex downstream tasks. However, the
potential of LRMs as evaluators for machine translation (MT) quality remains
underexplored. We provides the first systematic analysis of LRM-as-a-judge in
MT evaluation. We identify key challenges, revealing LRMs require tailored
evaluation materials, tend to "overthink" simpler instances and have issues
with scoring mechanisms leading to overestimation. To address these, we propose
to calibrate LRM thinking by training them on synthetic, human-like thinking
trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this
approach largely reduces thinking budgets by ~35x while concurrently improving
evaluation performance across different LRM scales from 7B to 32B (e.g.,
R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These
findings highlight the potential of efficiently calibrated LRMs to advance
fine-grained automatic MT evaluation.