SCICONVBENCH：针对计算科学中任务构建的多轮澄清过程的大语言模型基准测试

摘要

大型语言模型（LLMs）正越来越多地被部署为科学人工智能助手，同时大量基准测试评估了它们在知识检索、推理、代码生成和工具使用等方面的能力。然而，这些评估通常假设科学问题已经明确定义，而实际科学辅助工作往往始于一个定义不清的用户请求，需要通过对话逐步优化，之后才能可靠地执行计算、分析或实验。我们提出了SCICONVBENCH——一个面向科学任务构建中的多轮澄清基准，涵盖四个计算科学问题领域：流体力学、固体力学、材料科学和偏微分方程（PDEs）。SCICONVBENCH针对两个互补能力：获取缺失信息（歧义消解）以及检测并修正含有内部矛盾信息的错误请求（矛盾识别与修正）。我们的基准将结构化任务本体与基于评分细则的评估框架相结合，能够系统性地从三个维度衡量LLM性能：澄清行为、对话根基以及最终规格的保真度。当前前沿模型在矛盾识别与修正方面表现相对较好，但即便最佳模型在流体力学领域也仅能解决52.7%的歧义消解案例。我们进一步发现，前沿LLM常常做出隐含假设，并在未与用户对话达成共识的情况下进行隐式规格修复。SCICONVBENCH为评估可靠计算科学助手所需的上游对话推理能力奠定了基础。代码和数据见https://github.com/csml-rpi/SciConvBench。

English

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.