SCI-Verifier: 사고를 통한 과학적 검증기

초록

대규모 언어 모델(LLM)이 과학적 추론에 점점 더 많이 적용됨에 따라, 답변 형식의 복잡성과 동등한 표현의 다양성은 답변 검증을 중요하면서도 어려운 과제로 만들고 있습니다. 과학 분야에서의 기존 검증 연구는 두 가지 주요 한계를 가지고 있습니다: (a) 체계적인 평가 기준의 부재와 불충분한 학문적 범위로 인해 종합적인 평가가 어렵다는 점, 그리고 (b) 번거로운 규칙 설계나 프롬프트 엔지니어링에 대한 과도한 의존으로 인해 복잡한 추론 시나리오에서의 효과가 감소하거나 학제 간 일반화가 제한된다는 점입니다. 이러한 문제를 해결하기 위해, 우리는 데이터와 모델 두 차원에서 해결책을 제안합니다. 데이터 측면에서는 수학, 물리학, 생물학, 화학 및 일반 과학 질의응답을 포괄하는 학제 간 벤치마크인 SCI-VerifyBench를 구축했습니다. 이 벤치마크는 실제 LLM 응답을 기반으로 하며, 도메인 특화적 동등 변환을 통해 도전적이고 현실적인 데이터를 생성합니다. 모델 기반 및 전문가 주석을 통해 질과 다양성을 보장하여 검증 능력을 엄격하게 평가할 수 있습니다. 모델 측면에서는 검증을 위한 추론의 중요성을 강조하고, 과학 분야를 위한 통합 추론 강화 검증기인 SCI-Verifier를 소개합니다. 사후 훈련을 통해 SCI-Verifier는 강력한 논리적 추론 및 동등 판단 능력을 보여주면서도 간결하고 안정적인 출력을 유지합니다. SCI-VerifyBench와 SCI-Verifier는 과학적 검증을 위한 원칙적인 프레임워크를 제공함으로써, LLM의 과학 분야에서의 신뢰성과 적용 가능성을 강화하기 위한 체계적인 평가와 실질적인 경로를 제시합니다.

English

As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

SCI-Verifier: 사고를 통한 과학적 검증기

SCI-Verifier: Scientific Verifier with Thinking

초록

Support