SCICONVBENCH: 計算科学におけるタスク定式化のためのマルチターン明確化でのLLMのベンチマーキング

要旨

大規模言語モデル（LLM）は、科学AIアシスタントとしてますます広く導入されており、知識検索、推論、コード生成、ツール使用にわたる能力を評価するベンチマークが増加している。しかし、これらの評価は通常、科学問題が既に適切に定式化されていることを前提としている。一方、実際の科学的支援は、計算、分析、実験を確実に実行する前に、対話を通じて洗練されなければならない不適切なユーザー要求から始まることが多い。本稿では、流体力学、固体力学、材料科学、偏微分方程式（PDE）の4つの計算科学問題領域にわたる、科学的タスクの定式化におけるマルチターン明確化のためのベンチマークであるSCICONVBENCHを紹介する。SCICONVBENCHは、2つの補完的な能力を対象とする。すなわち、欠落情報の引き出し（曖昧性解消）と、内部に矛盾する情報を含む誤った要求の検出と修正（矛盾解決）である。本ベンチマークは、構造化されたタスクオントロジーとルーブリックベースの評価フレームワークを組み合わせ、明確化行動、会話的接地、最終仕様の忠実性の3次元にわたるLLM性能の系統的測定を可能にする。現在の最先端モデルは矛盾解決において比較的良好に機能するが、最良のモデルでも流体力学における曖昧性解消のケースの52.7%しか解決できない。さらに、最先端LLMはしばしば暗黙の仮定を行い、ユーザーとの会話に基づかない暗黙の仕様修正を実行することが明らかになった。SCICONVBENCHは、信頼性の高い計算科学アシスタントに必要な上流の会話的推論を評価するための基盤を確立する。コードとデータはhttps://github.com/csml-rpi/SciConvBenchで入手可能である。

English

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.