隠れたプロンプトは不要！提示のみの修正でAIピアレビューを攻略可能

要旨

AI 生成レビューが実験的ツールからピアレビュー基盤へと移行する中で、ロバスト性に関する懸念の大部分は、隠れた指示やプロンプトインジェクションといった明示的な攻撃に焦点が当てられてきた。本研究では、より困難で政策上重要な失敗モードを調査する。すなわち、隠れたテキスト、プロンプトインジェクション、そして手法、実験、図表、数式、証明、数値結果への変更は一切存在しない。攻撃者は、要約、貢献の位置づけ、関連研究、議論、物語構造といった、プレゼンテーションレベルの内容のみを改変する。我々は、敵対的リパッケージング（adversarial repackaging）を導入する。これは、科学的証拠を固定したまま、AI レビュアーからのフィードバックを利用してプレゼンテーションレベルの改訂を探索する閉ループ攻撃である。3つの主流のAIレビュアーにおいて、敵対的リパッケージングは75.1%の攻撃成功率を達成し、平均スコアの向上は+1.21/10であった。この効果は通常の散文の洗練では説明できない。また、関連研究の再配置や分析議論の拡張など、レビュアーが論文を解釈する方法を変える戦略が、局所的な洗練、表の整形、アルゴリズムボックスといった表面的な編集を大幅に上回る効果を持つことも明らかにした。我々の分析は、2つのより深い構造的失敗モードを明らかにする。第一に、AIレビュアーは説得されるよりも感銘を受けやすいことである。すなわち、長所を強調することは認識される価値を確実に高める一方、弱点を解消しようとする試みはしばしば逆効果となる。第二に、AIレビュアーは、制約に対処しているように見えることと実際に解決することとを混同する可能性があり、変更されていない証拠がより強力な科学的貢献として再解釈されることを許容する。これらの結果は、導入リスクが悪意ある隠れた指示だけでなく、論文のプレゼンテーション自体が最適化対象面として出現することにあることを示している。我々は、プレゼンテーションのみの編集下でもAIレビュアーが科学的内容に留まっているかどうかをテストするための、汚染のないローリングベンチマークと攻撃フレームワークを公開する。

English

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.