大規模言語モデルエージェントにおける多階層命令階層

要旨

大規模言語モデルエージェントは、システムメッセージ、ユーザープロンプト、ツール出力など、信頼性と権限のレベルが異なる多数の情報源から指示を受け取る。これらの指示が矛盾した場合、モデルは安全性と有効性を維持するため、最も高い権限を持つ指示を確実に従わなければならない。現在主流のパラダイムである指示階層（IH）は、固定的な役割ラベル（例：システム＞ユーザー）によって定義された、少数（通常5つ未満）の特権レベルを想定している。これは、現実世界のエージェント環境ではるかに多様な情報源や文脈間で矛盾が生じうるため不十分である。本研究では、任意に多数の特権レベルを持つ指示間の矛盾を解決するパラダイムであるMany-Tier Instruction Hierarchy（ManyIH）を提案する。また、ManyIHに対応する初のベンチマークManyIH-Benchを導入する。ManyIH-Benchは、最大12レベルに及ぶ特権の異なる矛盾する指示を扱うことをモデルに要求し、853のエージェントタスク（427のコーディングタスクと426の指示追従タスク）で構成される。ManyIH-Benchは、LLMによって開発され人間によって検証された制約を組み合わせ、46種類の実世界エージェントにわたる現実的かつ困難なテストケースを作成する。実験結果から、指示の矛盾がスケールアップすると、現在の最先端モデルでも性能が大幅に低下（精度約40%）することが明らかになった。本研究成果は、エージェント環境におけるきめ細かくスケーラブルな指示矛盾解決手法の開発が急務であることを示唆する。

English

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

大規模言語モデルエージェントにおける多階層命令階層

Many-Tier Instruction Hierarchy in LLM Agents

要旨

Support