ChatPaper.aiChatPaper

大型推理模型是优秀的翻译评估工具吗?性能分析与提升策略

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

October 23, 2025
作者: Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
cs.AI

摘要

近期大型推理模型(LRMs)的技术进展在生成最终答案前引入了中间"思考"过程,显著提升了处理复杂下游任务的推理能力。然而,LRMs作为机器翻译质量评估工具的潜力尚未得到充分探索。我们首次对LRMs作为翻译评估工具进行了系统性分析,发现其面临三大关键挑战:需要定制化评估材料、对简单实例容易"过度思考"、评分机制存在高估倾向。为解决这些问题,我们提出通过训练模型学习合成类人思考轨迹来校准LRM的思考过程。在WMT24度量基准测试中,该方法在7B至32B不同规模的LRMs上实现评估性能全面提升(例如R1-Distill-Qwen-7B模型相关性提升8.7个百分点),同时将思考预算大幅降低约35倍。这些发现表明,经过高效校准的LRMs有望推动细粒度自动机器翻译评估的发展。
English
Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
PDF41December 17, 2025