超越數學測驗解答：評估大型推理模型獲取資訊的能力

摘要

大型推理模型（LRMs）在現有基準測試中展現了卓越的數學問題解決能力，這些測試僅針對定義明確的問題進行評估。然而，這樣的評估設置存在一個關鍵缺陷，因為一個真正智能的代理不僅應該能夠解決問題（作為數學測驗解答者），還應該能夠在問題缺乏足夠信息時主動詢問，從而對用戶請求作出積極回應。為彌補這一缺陷，我們提出了一個包含兩類不同情境下不完整問題的新數據集。基於該數據集，我們對LRMs進行了系統性評估，揭示了它們在主動詢問信息方面的不足。此外，我們還發現了LRMs與過度思考和幻覺相關的行為，並強調了監督微調在學習此類能力中的潛力與挑戰。我們希望為開發具有真正智能而非僅僅解決問題能力的LRMs提供新的見解。

English

Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.

超越數學測驗解答：評估大型推理模型獲取資訊的能力

Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

摘要

Support