SCICONVBENCH：在計算科學中針對多輪澄清任務制定的大語言模型基準測試

摘要

大型語言模型（LLMs）正日益被部署為科學人工智慧助手，而越來越多的基準測試評估它們在知識檢索、推理、程式碼生成和工具使用方面的能力。然而，這些評估通常假設科學問題已明確界定，而實際的科學協助往往始於一個未明確界定的使用者請求，必須透過對話進行細化，之後才能可靠地進行計算、分析或實驗。我們推出 SCICONVBENCH，這是一個針對四個計算科學問題領域（流體力學、固體力學、材料科學和偏微分方程）中科學任務形成的多輪澄清基準測試。SCICONVBENCH 針對兩種互補能力：引導缺失資訊（消歧）以及檢測並修正包含內部矛盾資訊的錯誤請求（不一致性解決）。我們的基準測試將結構化任務本體論與基於評分標準的評估框架相結合，能夠系統地從三個維度衡量 LLM 的表現：澄清行為、對話基礎建立以及最終規格忠實度。當前的尖端模型在不一致性解決方面表現相對較好，但即使是最佳模型也僅能解決流體力學中 52.7% 的消歧案例。我們進一步發現，尖端 LLM 經常做出隱含假設，並執行未基於使用者對話的隱式規格修正。SCICONVBENCH 為評估可靠計算科學助手所需的上游對話推理奠定了基礎。程式碼和數據可在 https://github.com/csml-rpi/SciConvBench 取得。

English

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.