ChatPaper.aiChatPaper

事实核查:基于文献定位与执行验证的证据导向型评述体系

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

April 7, 2026
作者: Hang Xu, Ling Yue, Chaoqian Ouyang, Yuchen Liu, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang
cs.AI

摘要

机器学习领域的同行评审正面临投稿量激增与评审时间有限的双重压力。当前多数基于大语言模型的评审系统仅阅读论文原稿并基于文本叙述生成意见,这导致其输出易受论文表述质量影响,且在需要参考相关研究或开源代码进行论证时表现薄弱。我们提出FactReview——一个基于证据的评审系统,融合了主张提取、文献定位和基于执行的验证三大模块。该系统能够识别投稿论文的核心主张与报告结果,通过检索相关研究明确其技术定位,并在代码可用时以有限计算资源执行开源仓库,从而验证关键实证主张。最终系统生成简洁评审意见及证据报告,为主张标注五类标签:证据支持、论文自证、部分支持、存在冲突或证据不足。在CompGCN的案例研究中,FactReview复现了与链接预测和节点分类高度吻合的结果,同时揭示论文关于跨任务性能的广义主张存在局限:在MUTAG图分类任务中,复现结果为88.4%,而论文报告的最强基线仍达92.6%,故该主张仅获部分支持。这一案例更广泛表明:人工智能在同行评审中最有效的应用场景并非最终决策,而是作为证据收集工具辅助评审者做出更扎实的评估。代码已开源:https://github.com/DEFENSE-SEU/Review-Assistant。
English
Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.
PDF51April 9, 2026