StealthRL:针对多检测器规避的AI文本检测器强化学习复述攻击
StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors
February 9, 2026
作者: Suraj Ranganath, Atharv Ramesh
cs.AI
摘要
AI文本检测器面临严峻的鲁棒性挑战:对抗性复述攻击能在保持语义的同时规避检测。我们提出StealthRL强化学习框架,通过在真实对抗环境下对检测器进行压力测试来评估其鲁棒性。该框架基于Qwen3-4B模型,采用LoRA适配器与群体相对策略优化(GRPO)方法,针对多检测器集成系统训练复述策略,优化兼顾检测规避与语义保持的复合奖励函数。我们在安全关键型1%误报率工作点上,评估了六种攻击配置(M0-M5)对三类检测器(RoBERTa、FastDetectGPT和Binoculars)的效能。StealthRL实现了接近零的检测率(TPR@1%FPR均值0.001),将平均AUROC从0.74降至0.27,攻击成功率高达99.9%。关键发现是,攻击可迁移至训练未接触的检测器族,揭示出共有的架构脆弱性而非特定检测器的缺陷。我们还通过李克特量表进行基于LLM的质量评估,分析检测器分数分布以解释规避成功机制,并提供带Bootstrap置信区间的各检测器AUROC。研究结果揭示了当前AI文本检测存在的显著鲁棒性缺陷,并将StealthRL确立为规范的对抗评估范式。代码与评估流程已开源:https://github.com/suraj-ranganath/StealthRL。
English
AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.