無需隱藏提示！僅透過展示性修訂即可操弄AI同儕審查

摘要

隨著AI生成的審查意見從實驗性工具邁入同儕審查基礎設施，大多數穩健性問題的討論聚焦於隱藏指令與提示注入等明確攻擊手段。本研究探討一種更棘手且更具政策相關性的失效模式：無隱藏文字、無提示注入，亦未改變方法、實驗、圖表、方程式、證明或數值結果。攻擊者僅修改呈現層面內容，例如摘要、貢獻定位、相關文獻、討論及敘事結構。我們提出「對抗性重塑」：一種利用AI審查系統回饋進行呈現層面修訂的封閉循環攻擊，同時保持科學證據不變。在三種主流AI審查系統中，對抗性重塑達到75.1%的攻擊成功率，平均分數提升+1.21/10。此效果無法以一般文字潤飾解釋。我們更發現，改變審查者對論文解讀方式的策略（如重新定位相關文獻與擴展分析討論）明顯優於表面修改（如局部潤飾、表格格式調整與演算法框呈現）。我們的分析揭露兩個更深層的結構性失效模式。首先，AI審查系統「易受取信」勝過「被說服」：凸顯優勢能可靠增加感知價值，而試圖消解弱點往往適得其反。其次，AI審查系統可能混淆「展現解決限制」與「實際解決限制」的區別，使未經改變的證據被重新解讀為更強烈的科學貢獻。這些結果顯示，部署風險不僅來自惡意隱藏指令，更源於論文呈現本身已成為可優化的表面。我們發布無污染滾動基準測試與攻擊框架，用以檢測AI審查系統在僅修改呈現層面時，是否仍能錨定於科學內容。

English

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.