Stable-GFlowNet: Op Weg naar Diverse en Robuuste LLM Red-Teaming via Contrastieve Trajectoriebalans

Samenvatting

Large Language Model (LLM) Red-Teaming, dat proactief kwetsbaarheden van LLM's identificeert, is een essentieel proces om veiligheid te waarborgen. Het vinden van effectieve en diverse aanvallen tijdens red-teaming is belangrijk, maar het bereiken van beide is een uitdaging. Generative Flow Networks (GFN's) die distributie-matching uitvoeren, zijn veelbelovende methoden, maar ze staan bekend om trainingsinstabiliteit en mode collapse. Met name instabiele beloningen (rewards) in red-teaming versnellen mode collapse. Wij stellen Stable-GFN (S-GFN) voor, dat de schattingsfunctie voor de partitiefunctie Z in GFN elimineert en de trainingsinstabiliteit vermindert. S-GFN vermijdt Z-schatting door paarsgewijze vergelijkingen en gebruikt een robuuste masking-methodologie tegen ruisachtige beloningen. Daarnaast stellen we een fluency-stabilisator voor om te voorkomen dat het model vastloopt in lokale optima die onzin produceren. S-GFN biedt een stabielere training terwijl het het optimale beleid van GFN behoudt. We demonstreren de overweldigende aanvalsprestatie en diversiteit van S-GFN in verschillende settings.

English

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Stable-GFlowNet: Op Weg naar Diverse en Robuuste LLM Red-Teaming via Contrastieve Trajectoriebalans

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Samenvatting

Support