超越数学测验解答:评估大型推理模型的信息获取能力
Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information
August 15, 2025
作者: Youcheng Huang, Bowen Qin, Chen Huang, Duanyu Feng, Xi Yang, Wenqiang Lei
cs.AI
摘要
大型推理模型(LRMs)在数学领域展现出了卓越的问题解决能力,这一点通过现有基准测试在定义明确问题上的评估得到了验证。然而,这种评估设置存在一个关键缺陷,因为一个真正智能的代理不仅应能解决问题(如数学测验解答器),还应在问题信息不足时主动请求补充信息,从而实现对用户需求的积极响应。为填补这一空白,我们提出了一个包含多种情境下两类不完整问题的新数据集。基于该数据集,我们对LRMs进行了系统性评估,揭示了它们在主动寻求信息方面的不足。此外,我们还发现了LRMs在过度思考和幻觉方面的行为特征,并强调了监督微调在学习此类能力中的潜力与挑战。我们期望为开发具备真正智能而非仅能解决问题的LRMs提供新的洞见。
English
Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving
abilities in mathematics, as evaluated by existing benchmarks exclusively on
well-defined problems. However, such evaluation setup constitutes a critical
gap, since a genuine intelligent agent should not only solve problems (as a
math quiz solver), but also be able~to ask for information when the problems
lack sufficient information, enabling proactivity in responding users'
requests. To bridge such gap, we proposes a new dataset consisting of two types
of incomplete problems with diverse contexts. Based on the dataset, our
systematical evaluation of LRMs reveals their inability in proactively asking
for information. In addition, we uncover the behaviors related to overthinking
and hallucination of LRMs, and highlight the potential and challenges of
supervised fine-tuning in learning such ability. We hope to provide new
insights in developing LRMs with genuine intelligence, rather than just solving
problems.