D-REX：大型语言模型欺骗性推理检测基准

摘要

大型语言模型（LLMs）的安全性与对齐性对其负责任部署至关重要。当前的评估方法主要集中于识别和防止明显有害的输出。然而，这些方法往往未能解决一种更为隐蔽的故障模式：模型在内部进行恶意或欺骗性推理的同时，生成表面无害的输出。这种漏洞通常由复杂的系统提示注入触发，使模型能够绕过常规的安全过滤器，构成一个重大且尚未充分探索的风险。为填补这一空白，我们引入了欺骗性推理暴露套件（D-REX），这是一个新颖的数据集，旨在评估模型内部推理过程与其最终输出之间的差异。D-REX通过竞争性的红队演练构建，参与者设计对抗性系统提示以诱导此类欺骗行为。D-REX中的每个样本包含对抗性系统提示、终端用户的测试查询、模型看似无害的响应，以及关键的模型内部思维链，揭示了潜在的恶意意图。我们的基准测试促进了一项新的、至关重要的评估任务：欺骗性对齐的检测。我们证明，D-REX对现有模型和安全机制构成了重大挑战，凸显了迫切需要开发新技术来审查LLMs的内部过程，而不仅仅是其最终输出。

English

The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

D-REX：大型语言模型欺骗性推理检测基准

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

摘要

Support