D-REX: 大規模言語モデルにおける欺瞞的推論検出のためのベンチマーク

要旨

大規模言語モデル（LLMs）の安全性と整合性は、その責任ある展開において極めて重要である。現在の評価手法は、主に明らかに有害な出力を特定し防止することに焦点を当てている。しかし、これらの手法は、より潜在的な失敗モード、すなわち、悪意あるまたは欺瞞的な内部推論に基づきながらも一見無害な出力を生成するモデルに対処しきれていないことが多い。この脆弱性は、高度なシステムプロンプトインジェクションによって引き起こされることが多く、モデルが従来の安全フィルターを回避することを可能にし、未だ十分に検討されていない重大なリスクを生み出している。このギャップを埋めるため、我々は「欺瞞的推論暴露スイート（D-REX）」を導入する。これは、モデルの内部推論プロセスと最終出力の間の不一致を評価するために設計された新しいデータセットである。D-REXは、競争的なレッドチーミング演習を通じて構築され、参加者が欺瞞的行動を誘発するための敵対的システムプロンプトを作成した。D-REXの各サンプルには、敵対的システムプロンプト、エンドユーザーのテストクエリ、モデルの一見無害な応答、そして重要な点として、モデルの内部的な連鎖的思考（chain-of-thought）が含まれており、これが根底にある悪意を明らかにする。我々のベンチマークは、欺瞞的整合性の検出という新たで不可欠な評価タスクを可能にする。D-REXが既存のモデルと安全メカニズムにとって重大な課題を提示することを示し、LLMsの最終出力だけでなく、その内部プロセスを精査する新たな技術の緊急の必要性を強調する。

English

The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

D-REX: 大規模言語モデルにおける欺瞞的推論検出のためのベンチマーク

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

要旨

Support