SCICONVBENCH:针对计算科学中任务构建的多轮澄清过程的大语言模型基准测试
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
May 18, 2026
作者: Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan
cs.AI
摘要
大型语言模型(LLMs)正越来越多地被部署为科学人工智能助手,同时大量基准测试评估了它们在知识检索、推理、代码生成和工具使用等方面的能力。然而,这些评估通常假设科学问题已经明确定义,而实际科学辅助工作往往始于一个定义不清的用户请求,需要通过对话逐步优化,之后才能可靠地执行计算、分析或实验。我们提出了SCICONVBENCH——一个面向科学任务构建中的多轮澄清基准,涵盖四个计算科学问题领域:流体力学、固体力学、材料科学和偏微分方程(PDEs)。SCICONVBENCH针对两个互补能力:获取缺失信息(歧义消解)以及检测并修正含有内部矛盾信息的错误请求(矛盾识别与修正)。我们的基准将结构化任务本体与基于评分细则的评估框架相结合,能够系统性地从三个维度衡量LLM性能:澄清行为、对话根基以及最终规格的保真度。当前前沿模型在矛盾识别与修正方面表现相对较好,但即便最佳模型在流体力学领域也仅能解决52.7%的歧义消解案例。我们进一步发现,前沿LLM常常做出隐含假设,并在未与用户对话达成共识的情况下进行隐式规格修复。SCICONVBENCH为评估可靠计算科学助手所需的上游对话推理能力奠定了基础。代码和数据见https://github.com/csml-rpi/SciConvBench。
English
Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.