科学レビューの自動化に向けて：GoogleのPaper Assistantツール

要旨

人工知能は科学的発見に革命をもたらし、仮説生成から数学的定理の証明に至るまであらゆるプロセスを加速している。しかし、この急速な加速はシステム上の課題を生み出している。すなわち、従来の人間によるピアレビューでは、AI支援科学の流入に追随できる規模に拡張できないのである。最終的にこの緊張関係を解消するには、検証とレビューのプロセス自体を加速するためにもAIを活用する必要がある。この移行に関する議論を枠組み化するため、我々は科学的評価におけるAIと人間の協働の4段階の進行度からなる分類法を提案し、各段階に伴う様々なトレードオフについて論じる。この未来への一歩として、我々は深い科学的レビューと検証のために構築されたエージェント型AIフレームワークであるPaper Assistant Tool（PAT）を導入する。PATは科学的原稿全体を入力として取り込み、理論的結果のチェック、実験の検証、改善点の提案、潜在的な欠陥の特定など、包括的な評価を生成する。推論スケーリング技術を活用することで、PATは単一のモデル呼び出し単独では発見できないより深い問題を特定することが可能となり、SPOTベンチマークにおける数学的誤りのzero-shot再現率を34%改善する。2つの主要なコンピュータサイエンス会議（STOCおよびICML）において、著者向けの投稿前ツールとしてPATのパイロット展開を行った結果、重要な誤りを特定し、研究論文に対する実質的な改善を提案できることが実証された。PATはエラーを早期に発見することで、査読者にかかる認知的負担を軽減しつつ、レビュープロセスの結果に対する彼らの管理権を維持する。

English

Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy consisting of four progressive levels of AI-human collaboration in scientific evaluation, and discuss various trade-offs involved with each. As a step toward this future, we introduce the Paper Assistant Tool (PAT), an agentic AI framework built for deep scientific review and verification. PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws. By utilizing inference scaling techniques, PAT is able to identify deeper issues than a single model call alone, achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark. Pilot deployments of PAT as a pre-submission tool for authors at two major Computer Science conferences -- STOC and ICML -- demonstrate its ability to identify critical errors and suggest substantive improvements to research papers. By catching errors early, PAT eases the cognitive burden placed on referees, while preserving their control over the outcomes of the review process.