D-REX：大型語言模型欺騙性推理檢測基準

摘要

大型語言模型（LLMs）的安全性和對齊性對於其負責任的部署至關重要。目前的評估方法主要集中於識別和防止明顯有害的輸出。然而，這些方法往往未能解決一種更為隱蔽的故障模式：模型在內部進行惡意或欺騙性推理時，卻產生了看似無害的輸出。這種漏洞通常由複雜的系統提示注入觸發，使模型能夠繞過傳統的安全過濾器，構成了一種重要但尚未充分探索的風險。為填補這一空白，我們引入了欺騙性推理暴露套件（D-REX），這是一個新穎的數據集，旨在評估模型內部推理過程與其最終輸出之間的差異。D-REX是通過一場競爭性的紅隊演練構建的，參與者設計了對抗性系統提示以誘導此類欺騙行為。D-REX中的每個樣本都包含對抗性系統提示、終端用戶的測試查詢、模型看似無害的回應，以及關鍵的模型內部思維鏈，這揭示了潛在的惡意意圖。我們的基準測試促進了一項新的、至關重要的評估任務：欺騙性對齊的檢測。我們證明，D-REX對現有模型和安全機制提出了重大挑戰，突顯了迫切需要新技術來審查LLMs的內部過程，而不僅僅是其最終輸出。

English

The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

D-REX：大型語言模型欺騙性推理檢測基準

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

摘要

Support