向竊國者致敬:探討去中心化GRPO中的攻擊與防禦策略
Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
November 12, 2025
作者: Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
cs.AI
摘要
群組相對策略優化(GRPO)在大型語言模型(LLM)的後訓練中展現出卓越的應用價值。該方法透過模型對提示詞生成回應,並藉由強化學習機制讓模型習得偏好性完成結果。由於通訊量極小,GRPO本質上適合分散式訓練架構——多個節點可同時處理提示詞,再以字串形式交換結果。本研究首度提出針對分散式GRPO的對抗攻擊技術,證實惡意節點能透過脫離上下文與上下文內攻擊兩種模式,在良性模型中注入任意惡意標記以毒化系統。透過數學解題與程式編碼的實證案例,我們展示對抗攻擊可輕易污染良性節點,破壞其本地LLM後訓練過程,僅需50次迭代即可達成100%攻擊成功率。我們提出兩種防禦方案,分別對應「所有用戶訓練相同模型」與「用戶訓練不同模型」的情境,實驗顯示這些防禦機制可實現最高100%的攻擊阻斷率,使攻擊行為失效。
English
Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.