RefineBench:基于清单的语言模型精炼能力评估框架
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
November 27, 2025
作者: Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi
cs.AI
摘要
语言模型能否自我优化其回答?随着现实应用中用户频繁提出改进需求,这一问题日益重要。然而现有研究大多基于可验证任务(如竞赛数学或带简化框架的符号推理)测试模型优化能力,而用户常提出开放式问题并提供不同程度的反馈。近期推理模型在思维链中展现的自省模式更凸显该问题价值。为此我们推出RefineBench基准,包含11个领域的1000个挑战性问题,并配套基于检查表的评估框架。我们评估两种优化模式:(1)引导式优化,即向模型提供自然语言反馈;(2)自我优化,即模型在无指导下自主改进。在自我优化场景下,即便是Gemini 2.5 Pro和GPT-5等前沿模型也仅获得31.3%和29.1%的基准分数,且多数模型无法在迭代中持续提升(如Gemini 2.5 Pro仅提升1.8%,DeepSeek-R1反而下降0.1%)。相比之下,在引导式优化中,无论是专有模型还是大型开源模型(>700亿参数)都能通过定向反馈在五轮对话内将回答优化至近乎完美。这些发现表明前沿模型需要突破性进展才能自我修正错误回答,而RefineBench为追踪进展提供了重要测试平台。
English
Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.