SCI-Verifier：具備思考能力的科學驗證器

摘要

随着大型语言模型（LLMs）在科学推理中的应用日益广泛，答案格式的复杂性和等价表达的多样性使得答案验证成为一项关键而具有挑战性的任务。现有的科学领域验证研究存在两大主要局限：（a）缺乏系统化的评估标准和不足的学科覆盖，这阻碍了其全面评估；（b）过度依赖繁琐的规则设计或提示工程，这降低了其在复杂推理场景中的有效性或限制了其跨学科的泛化能力。为解决这些挑战，我们在数据和模型两个层面提出了解决方案。在数据层面，我们构建了SCI-VerifyBench，一个涵盖数学、物理、生物、化学及一般科学问答的跨学科基准。该基准基于真实的LLM响应构建，并通过领域特定的等价转换增强，生成了具有挑战性和现实性的数据。基于模型的标注和专家注释确保了质量和多样性，从而能够严格评估验证能力。在模型层面，我们强调了推理对于验证的重要性，并引入了SCI-Verifier，一个面向科学领域的统一推理增强验证器。通过后训练，SCI-Verifier展示了强大的逻辑推理和等价判断能力，同时保持了简洁稳定的输出。SCI-VerifyBench与SCI-Verifier共同为科学验证提供了一个原则性的框架，不仅提供了系统化的评估，还增强了LLMs在科学领域中的可靠性和适用性。

English

As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

SCI-Verifier：具備思考能力的科學驗證器

SCI-Verifier: Scientific Verifier with Thinking

摘要

Support