ChatPaper.aiChatPaper

向窃国者致敬:探索去中心化GRPO中的攻击与防御策略 (注:GRPO在此处作为专业术语保留英文缩写,根据上下文推测可能指代"Group Policy"或类似技术概念,实际翻译需根据具体技术背景调整)

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

November 12, 2025
作者: Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
cs.AI

摘要

群体相对策略优化(GRPO)在大语言模型(LLM)的后训练中展现出巨大应用价值。该方法通过模型对提示词生成回答,并借助强化学习机制习得更优的完成结果。由于通信量小,GRPO天然适用于去中心化训练——多个节点可并行响应提示词,再以字符串形式交换结果。本研究首次提出针对去中心化GRPO的对抗攻击方案:恶意参与方可通过上下文无关和上下文相关两种攻击模式,向良性模型注入任意恶意标记。通过数学推理与代码生成任务的实证案例,我们证明对抗攻击能轻易污染良性节点,破坏其本地LLM后训练过程,仅需50轮迭代即可实现高达100%的攻击成功率。针对用户群体训练统一模型或差异化模型两种场景,我们提出相应防御机制。实验表明这些防御措施可实现最高100%的拦截率,使攻击完全失效。
English
Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.
PDF263December 1, 2025