SCICONVBENCH: Het benchmarken van LLMs op meerstaps verduidelijking voor taakformulering in computationele wetenschap

Samenvatting

Grote Taalmodellen (LLMs) worden steeds vaker ingezet als wetenschappelijke AI-assistenten, en een groeiend aantal benchmarks evalueert hun capaciteiten op het gebied van kennisverwerving, redeneren, codegeneratie en het gebruik van hulpmiddelen. Deze evaluaties gaan er echter doorgaans van uit dat het wetenschappelijke probleem al goed geformuleerd is, terwijl praktische wetenschappelijke ondersteuning vaak begint met een slecht geformuleerde gebruikersvraag die via dialoog moet worden verfijnd voordat een berekening, analyse of experiment betrouwbaar kan worden uitgevoerd. Wij introduceren SCICONVBENCH, een benchmark voor meer-staps verduidelijking bij het formuleren van wetenschappelijke taken, in vier computationele wetenschapsdomeinen: vloeistofmechanica, vaste-stofmechanica, materiaalkunde en partiële differentiaalvergelijkingen (PDV's). SCICONVBENCH richt zich op twee complementaire capaciteiten: het achterhalen van ontbrekende informatie (disambiguatie) en het detecteren en corrigeren van foutieve verzoeken die intern tegenstrijdige informatie bevatten (inconsistentieresolutie). Onze benchmark combineert een gestructureerde taakontologie met een rubriek-gebaseerd evaluatiekader, waardoor systematische meting van LLM-prestaties op drie dimensies mogelijk is: verhelderingsgedrag, conversationele grounding, en getrouwheid van de uiteindelijke specificatie. Huidige frontiermodellen presteren relatief goed op inconsistentieresolutie, maar zelfs het beste model lost slechts 52,7% van de disambiguatiegevallen in vloeistofmechanica op. Verder constateren we dat frontiermodellen regelmatig stilzwijgende aannames doen en impliciete specificeerherstelwerkzaamheden uitvoeren die niet zijn gebaseerd op het gesprek met gebruikers. SCICONVBENCH legt een fundament voor het evalueren van de upstream conversationele redeneervaardigheden die een betrouwbare computationele wetenschapsassistent vereist. De code en gegevens zijn te vinden op https://github.com/csml-rpi/SciConvBench.

English

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.