无需隐藏提示！仅通过呈现方式修改即可操纵AI同行评审

摘要

随着人工智能生成式评审从实验性工具过渡到同行评审基础设施，多数鲁棒性研究聚焦于显式攻击，如隐藏指令和提示注入。我们研究了一种更棘手且更具政策相关性的失效模式：无隐藏文本、无提示注入，且对方法、实验、图表、公式、证明或数值结果均不修改。攻击者仅改动展示层面内容，如摘要、贡献框架、相关工作、讨论和叙述结构。我们引入对抗性重构：一种闭环攻击方法，借助AI评审者的反馈搜索展示层面的修订策略，同时保持科学证据不变。针对三种主流AI评审系统，对抗性重构实现了75.1%的攻击成功率，平均评分提升+1.21/10分。该效应无法用常规的润色工艺解释。我们还发现，改变评审者解读论文方式的策略（如相关工作重新定位、分析性讨论拓展）显著优于表面编辑（如局部润色、表格格式化、算法框图）。分析揭示两种更深层的结构性失效模式。其一，AI评审者更易被"打动"而非"说服"：突出优势能可靠提升感知价值，而试图化解弱点往往适得其反。其二，AI评审者可能混淆"应对局限性的表象"与"实际解决局限性"之间的区别，使得未修改的证据被重新解读为更强的科学贡献。这些结果表明，部署风险不仅来自恶意隐藏指令，更在于论文呈现本身已成为可优化的曲面。我们发布一个无污染的滚动基准数据集及攻击框架，用于检验AI评审者在仅修改展示层时是否仍能锚定科学内容。

English

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.