ChatPaper.aiChatPaper

DeepSeek 对比 o3-mini:推理型大语言模型在机器翻译与摘要任务评估中的表现如何?

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

April 10, 2025
作者: Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
cs.AI

摘要

具备推理能力的大型语言模型(LLMs)近期在复杂逻辑与数学任务中展现了卓越性能,然而其在自然语言生成评估中的有效性尚未得到充分探索。本研究系统性地比较了基于推理的LLMs(DeepSeek-R1与OpenAI o3)与其非推理版本在机器翻译(MT)和文本摘要(TS)评估任务中的表现。我们评估了涵盖三种架构类别的八种模型,包括最先进的推理模型、其蒸馏变体(参数规模从8B到70B不等)以及相应的传统非推理LLMs。基于WMT23和SummEval基准的实验结果表明,推理能力的优势高度依赖于模型与任务:尽管OpenAI o3-mini模型随着推理强度的增加展现出持续的性能提升,DeepSeek-R1在多数情况下表现逊色于其非推理版本,但在TS评估的某些方面例外。相关性分析显示,在o3-mini模型中,推理令牌使用量的增加与评估质量呈正相关。此外,我们的研究还发现,推理能力的蒸馏在中等规模模型(32B)中保持了合理的性能,但在较小变体(8B)中显著下降。本工作首次全面评估了推理LLMs在自然语言生成评估中的应用,并为其实际使用提供了洞见。
English
Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Summary

AI-Generated Summary

PDF42April 15, 2025