SCICONVBENCH: 전산 과학에서의 작업 정식화를 위한 다중 턴 명확화에 대한 LLM 벤치마킹

초록

대규모 언어 모델(LLM)이 과학적 AI 어시스턴트로 점점 더 많이 배치되고 있으며, 지식 검색, 추론, 코드 생성, 도구 사용 등 다양한 측면에서 이들의 역량을 평가하는 벤치마크가 증가하고 있다. 그러나 이러한 평가는 일반적으로 과학적 문제가 이미 잘 정의되어 있다고 가정하는 반면, 실제 과학적 지원은 종종 잘못 정의된 사용자 요청에서 시작되며, 이는 계산, 분석 또는 실험을 신뢰성 있게 수행하기 전에 대화를 통해 정제되어야 한다. 본 논문은 유체 역학, 고체 역학, 재료 과학, 편미분 방정식(PDE)의 네 가지 계산 과학 문제 영역에서 과학적 작업 정립을 위한 다중 턴 명료화 벤치마크인 SCICONVBENCH를 소개한다. SCICONVBENCH는 두 가지 상호 보완적 역량, 즉 누락된 정보를 이끌어내는 것(명확화)과 내부적으로 모순된 정보를 포함한 잘못된 요청을 탐지하고 수정하는 것(불일치 해결)을 목표로 한다. 본 벤치마크는 구조화된 작업 온톨로지와 루브릭 기반 평가 프레임워크를 결합하여 명료화 행동, 대화 기반 근거, 최종 명세 충실도의 세 가지 차원에서 LLM 성능을 체계적으로 측정할 수 있게 한다. 현재 최첨단 모델들은 불일치 해결에서 비교적 좋은 성능을 보이지만, 최고 성능 모델조차 유체 역학 분야의 명확화 사례 중 52.7%만 해결한다. 또한 최첨단 LLM은 사용자와의 대화에 근거하지 않은 묵시적 가정을 자주 하거나 암시적 명세 수정을 수행하는 것으로 나타났다. SCICONVBENCH는 신뢰할 수 있는 계산 과학 어시스턴트가 요구하는 상위 대화 추론 능력을 평가하기 위한 기반을 마련한다. 코드와 데이터는 https://github.com/csml-rpi/SciConvBench에서 확인할 수 있다.

English

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.