IHEval: 命令階層に従う言語モデルの評価

要旨

システムメッセージからユーザーメッセージ、会話履歴、ツール出力までの優先順位を定める命令階層は、言語モデル（LM）の一貫性と安全性を確保する上で不可欠です。しかし、その重要性にもかかわらず、このトピックは十分な注目を集めておらず、命令階層を遵守するモデルの能力を評価する包括的なベンチマークも不足しています。このギャップを埋めるため、我々はIHEvalという新たなベンチマークを導入しました。IHEvalは、異なる優先順位の命令が一致または衝突するケースをカバーする9つのタスクにわたる3,538の例で構成されています。主要なLMの評価を通じて、それらが命令の優先順位を認識することに苦戦していることが明らかになりました。評価されたすべてのモデルは、命令が衝突する状況において、元の命令遵守性能と比較して急激な性能低下を示しました。さらに、最も競争力のあるオープンソースモデルでさえ、そのような衝突を解決する際の精度は48%に留まりました。これらの結果は、今後のLM開発において、特定の最適化が必要であることを強調しています。

English

The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.

IHEval: 命令階層に従う言語モデルの評価

IHEval: Evaluating Language Models on Following the Instruction Hierarchy

要旨

Support