事实核查:基于文献定位与执行验证的证据导向型评述体系
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
April 7, 2026
作者: Hang Xu, Ling Yue, Chaoqian Ouyang, Yuchen Liu, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang
cs.AI
摘要
机器学习领域的同行评审正面临投稿量激增与评审时间有限的双重压力。当前多数基于大语言模型的评审系统仅阅读稿件本身,从论文的自我陈述中生成意见。这种做法使其输出易受论文表述质量影响,且当评审所需证据存在于相关研究或开源代码时显得力不从心。我们提出FactReview——一个基于证据的评审系统,融合了主张提取、文献定位和基于执行的验证三大模块。该系统能够识别投稿论文的核心主张与报告结果,通过检索相关研究明确其技术定位,并在代码可用时通过有限资源执行开源库以验证关键实证主张。最终生成包含五级证据标签(完全支持、论文自证、部分支持、存在冲突、证据不足)的简明评审报告与证据清单。以CompGCN的案例研究为例,FactReview在链接预测和节点分类任务中成功复现了与论文高度吻合的结果,但同时发现论文关于跨任务性能的广义主张未能完全成立:在MUTAG图分类任务中,复现结果为88.4%,而论文报告的最强基线仍保持92.6%。该主张因此被判定为部分支持。更广泛而言,这一案例表明人工智能在同行评审中的最大价值并非充当最终决策者,而是作为证据收集工具,帮助评审者做出更基于实证的评估。代码已开源:https://github.com/DEFENSE-SEU/Review-Assistant。
English
Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.