SciCoQA:科学论文与代码对齐的质量保证体系
SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
January 19, 2026
作者: Tim Baumgärtner, Iryna Gurevych
cs.AI
摘要
我们推出SciCoQA数据集,用于检测科学论文与其代码库之间的差异以确保实现过程的忠实性。该数据集基于GitHub议题和可复现性论文构建,并提出一种合成数据生成方法以规模化构建论文-代码差异样本。我们详细分析了论文与代码间的差异类型,提出差异分类体系以深入理解不匹配现象。该数据集共包含611个论文-代码差异实例(81个真实案例,530个合成案例),涵盖人工智能、物理学、定量生物学等多元计算科学领域。我们对21个大语言模型的评估表明,SciCoQA任务具有较高挑战性——尤其当涉及论文细节遗漏、长上下文输入及预训练语料外数据时表现明显。评估中表现最佳的GPT-5模型仅能检测出45.7%的真实世界论文-代码差异。
English
We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7\% of real-world paper-code discrepancies.