TRIDENT：通過三維多樣化紅隊數據合成提升大型語言模型的安全性

摘要

大型語言模型（LLMs）在多種自然語言處理任務中表現卓越，但仍易於生成有害內容或被利用於惡意目的。儘管已引入安全對齊數據集，通過監督式微調（SFT）來緩解此類風險，這些數據集往往缺乏全面的風險覆蓋。現有的大多數數據集主要關注詞彙多樣性，而忽略了其他關鍵維度。為解決這一局限，我們提出了一種新穎的分析框架，系統性地衡量對齊數據集在三個基本維度上的風險覆蓋：詞彙多樣性、惡意意圖和越獄策略。我們進一步引入了TRIDENT，這是一個自動化流程，利用基於角色的零樣本LLM生成，產生涵蓋這些維度的多樣化且全面的指令。每條有害指令都配有一個道德對齊的回應，從而形成兩個數據集：TRIDENT-Core，包含26,311個示例，以及TRIDENT-Edge，包含18,773個示例。在TRIDENT-Edge上對Llama 3.1-8B進行微調，顯示出顯著的改進，與在WildBreak數據集上微調的最佳基線模型相比，平均降低了14.29%的傷害分數，並減少了20%的攻擊成功率。

English

Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.

TRIDENT：通過三維多樣化紅隊數據合成提升大型語言模型的安全性

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

摘要

Support