SCI-Verifier: 思考を伴う科学的検証システム

要旨

大規模言語モデル（LLM）が科学的推論にますます適用されるにつれ、回答形式の複雑さと等価表現の多様性により、回答検証は重要でありながらも困難な課題となっています。既存の科学的領域における検証研究は、以下の2つの主要な制約に直面しています：(a) 体系的な評価基準の欠如と分野カバレッジの不十分さにより、包括的な評価が妨げられていること、(b) 煩雑なルール設計やプロンプトエンジニアリングへの過度の依存により、複雑な推論シナリオでの有効性が低下したり、分野横断的な汎化が制限されたりしていることです。これらの課題に対処するため、我々はデータレベルとモデルレベルの両方で解決策を提案します。データ面では、数学、物理学、生物学、化学、および一般的な科学QAをカバーする学際的ベンチマーク「SCI-VerifyBench」を構築します。このベンチマークは実際のLLMの応答から構築され、ドメイン固有の等価変換を適用することで、挑戦的で現実的なデータを生成します。モデルベースおよび専門家によるアノテーションにより、品質と多様性が確保され、検証能力の厳密な評価が可能となります。モデル面では、検証における推論の重要性を強調し、科学的領域向けの統一された推論強化型検証器「SCI-Verifier」を導入します。ポストトレーニングを通じて、SCI-Verifierは強力な論理的推論と等価性判断能力を示しつつ、簡潔で安定した出力を維持します。SCI-VerifyBenchとSCI-Verifierを組み合わせることで、科学的検証のための原則に基づいたフレームワークを提供し、LLMの科学的領域における信頼性と適用性を向上させるための体系的な評価と実践的な道筋を提示します。

English

As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

SCI-Verifier: 思考を伴う科学的検証システム

SCI-Verifier: Scientific Verifier with Thinking

要旨

Support