超越数学测验解答：评估大型推理模型的信息获取能力

摘要

大型推理模型（LRMs）在数学领域展现出了卓越的问题解决能力，这一点通过现有基准测试在定义明确问题上的评估得到了验证。然而，这种评估设置存在一个关键缺陷，因为一个真正智能的代理不仅应能解决问题（如数学测验解答器），还应在问题信息不足时主动请求补充信息，从而实现对用户需求的积极响应。为填补这一空白，我们提出了一个包含多种情境下两类不完整问题的新数据集。基于该数据集，我们对LRMs进行了系统性评估，揭示了它们在主动寻求信息方面的不足。此外，我们还发现了LRMs在过度思考和幻觉方面的行为特征，并强调了监督微调在学习此类能力中的潜力与挑战。我们期望为开发具备真正智能而非仅能解决问题的LRMs提供新的洞见。

English

Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.

超越数学测验解答：评估大型推理模型的信息获取能力

Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

摘要

Support