FactReview: 文献配置と実行ベース主張検証による証拠に基づくレビュー

要旨

機械学習分野における査読プロセスは、投稿数の増加と査読者の限られた時間によって、大きな圧力にさらされています。現在の大半のLLMベースの査読システムは、投稿論文のみを読み、論文自身の記述からコメントを生成します。このため、その出力は論文の表現の質に影響されやすく、査読に必要な証拠が関連研究や公開コードにある場合に弱いという課題があります。本研究では、証拠に基づいた査読システム「FactReview」を提案します。FactReviewは、主張の抽出手法、関連文献の位置付け、実行ベースの主張検証を組み合わせたシステムです。投稿論文に対して、FactReviewは主要な主張と報告された結果を特定し、近隣の研究を検索して論文の技術的な位置付けを明確にし、コードが利用可能な場合は、公開されたリポジトリを限られた計算資源内で実行し、中心的な実証的主張を検証します。その後、簡潔な査読コメントと、各主要な主張に「支持される」「論文により支持される」「部分的に支持される」「矛盾する」「結論不能」の5つのラベルのいずれかを割り当てた証拠報告書を生成します。CompGCNに関するケーススタディでは、FactReviewはリンク予測とノード分類において報告値に極めて近い結果を再現しました。しかし同時に、タスク横断的な性能に関する論文の広範な主張が完全には支持されないことも示しました。MUTAGグラフ分類タスクでは、再現結果は88.4%であったのに対し、論文内で報告された最強のベースラインは92.6%のままでした。したがって、この主張は部分的にしか支持されないことになります。より広く捉えると、この事例は、AIが査読において最も有用なのは最終的な判断を下すものとしてではなく、証拠を収集し、査読者がより証拠に基づいた評価を生成するのを支援するツールとしてであることを示唆しています。コードはhttps://github.com/DEFENSE-SEU/Review-Assistantで公開されています。

English

Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.