Saluto al Ladro: Esplorare Attacchi e Difese nel GRPO Decentralizzato

Abstract

Il Group Relative Policy Optimization (GRPO) ha dimostrato una grande utilità nel post-addestramento dei Large Language Model (LLM). Nel GRPO, i prompt vengono elaborati dal modello e, attraverso l'apprendimento per rinforzo, si apprendono i completamenti preferiti. Grazie al ridotto volume di comunicazione, il GRPO è intrinsecamente adatto per l'addestramento decentralizzato, poiché i prompt possono essere elaborati contemporaneamente da più nodi e poi scambiati sotto forma di stringhe. In questo lavoro, presentiamo il primo attacco avversario in un contesto di GRPO decentralizzato. Dimostriamo che parti maligne possono avvelenare tali sistemi iniettando token malevoli arbitrari in modelli benigni, sia tramite attacchi fuori contesto che in contesto. Utilizzando esempi empirici tratti da attività matematiche e di coding, mostriamo che gli attacchi avversari possono facilmente avvelenare i nodi benigni, inquinando il loro post-addestramento locale degli LLM, raggiungendo tassi di successo dell'attacco fino al 100% in appena 50 iterazioni. Proponiamo due metodi per difendersi da questi attacchi, a seconda che tutti gli utenti addestrino lo stesso modello o modelli diversi. Mostriamo che queste difese possono raggiungere tassi di blocco fino al 100%, rendendo l'attacco impossibile.

English

Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.

Saluto al Ladro: Esplorare Attacchi e Difese nel GRPO Decentralizzato

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Abstract

Support