TRIDENT：通过三维多样化红队数据合成提升大语言模型安全性

摘要

大型语言模型（LLMs）在多种自然语言处理任务中表现出色，但仍易生成有害内容或被恶意利用。尽管已引入安全对齐数据集通过监督微调（SFT）来缓解此类风险，但这些数据集往往缺乏全面的风险覆盖。现有数据集大多侧重于词汇多样性，而忽视了其他关键维度。为解决这一局限，我们提出了一种新颖的分析框架，系统性地衡量对齐数据集在三个核心维度上的风险覆盖：词汇多样性、恶意意图及越狱策略。我们进一步推出了TRIDENT，一个自动化流程，利用基于角色的零样本LLM生成技术，产出跨这些维度的多样化且全面的指令集。每条有害指令均配以伦理对齐的响应，从而形成两个数据集：TRIDENT-Core，包含26,311个示例，以及TRIDENT-Edge，拥有18,773个示例。在TRIDENT-Edge上对Llama 3.1-8B进行微调，展现了显著改进，相较于在WildBreak数据集上微调的最佳基线模型，平均降低了14.29%的伤害评分，并减少了20%的攻击成功率。

English

Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.

TRIDENT：通过三维多样化红队数据合成提升大语言模型安全性

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

摘要

Support