사실검토: 문헌 위치 설정 및 실행 기반 주장 검증을 통한 증거 기반 리뷰

초록

기계 학습 분야의 동료 검토는 제출 논문 수의 증가와 제한된 검토자 시간으로 인해 점점 더 큰 압력을 받고 있습니다. 대부분의 LLM 기반 검토 시스템은 원고만 읽고 논문 자체의 서술에서 코멘트를 생성합니다. 이로 인해 그 출력은 논문의 표현 품질에 민감하게 반응하며, 검토에 필요한 증거가 관련 연구나 공개된 코드에 있는 경우 취약해집니다. 본 연구는 주장 추출, 문헌 위치 파악, 실행 기반 주장 검증을 결합한 증거 기반 검토 시스템인 FactReview를 소개합니다. FactReview는 제출된 논문에 대해 주요 주장과 보고된 결과를 식별하고, 논문의 기술적 위치를 명확히 하기 위해 유사 연구를 검색하며, 코드가 이용 가능한 경우 공개된 저장소를 제한된 예산 내에서 실행하여 핵심 실증적 주장을 검증합니다. 그런 다음 간결한 검토 보고서와 각 주요 주장에 대해 '지원됨', '논문에 의해 지원됨', '부분적으로 지원됨', '상충됨', '결론을 내리기 어려움' 중 하나의 라벨을 부여하는 증거 보고서를 생성합니다. CompGCN에 대한 사례 연구에서 FactReview는 링크 예측 및 노드 분류에서 보고된 결과와 밀접하게 일치하는 결과를 재현했으나, 동시에 다양한 태스크 전반에 대한 논문의 광범위한 성능 주장이 완전히 지지되지는 않음을 보여주었습니다. MUTAG 그래프 분류에서 재현된 결과는 88.4%인 반면, 논문에서 보고된 가장 강력한 기준 모델의 성능은 92.6%로 유지됩니다. 따라서 해당 주장은 부분적으로만 지지됩니다. 더 광범위하게 볼 때, 이 사례는 동료 검토에서 AI가 최종 결정권자로서가 아니라 증거 수집과 검토자가 더욱 증거 기반 평가를 생산하도록 돕는 도구로서 가장 유용함을 시사합니다. 코드는 https://github.com/DEFENSE-SEU/Review-Assistant 에 공개되어 있습니다.

English

Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.