大型语言模型代理中的多层指令层级
Many-Tier Instruction Hierarchy in LLM Agents
April 10, 2026
作者: Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi
cs.AI
摘要
大型语言模型智能体从多个来源接收指令——系统消息、用户提示、工具输出等——每个来源都承载着不同层级的信任度与权威性。当这些指令产生冲突时,模型必须可靠地遵循最高权限指令以确保安全性和有效性。当前主流范式——指令层级(IH)假设存在由固定角色标签(如系统>用户)定义的、数量有限的权限层级(通常少于五级)。这种设定难以适应现实世界的智能体应用场景,因为冲突可能出现在更多来源和情境中。本研究提出多层级指令体系(ManyIH),该范式能解决具有任意多权限层级的指令冲突问题。我们同步推出首个面向ManyIH的基准测试ManyIH-Bench,要求模型在处理多达12个不同权限层级的冲突指令时进行决策,包含853项智能体任务(427项编程任务与426项指令遵循任务)。该基准通过LLM生成并经人工验证的约束条件,构建了涵盖46种现实智能体的真实且高难度的测试案例。实验表明,当前最先进的模型在指令冲突规模扩大时表现欠佳(准确率约40%)。这项工作揭示了在智能体场景中亟需开发能实现细粒度、可扩展指令冲突解决方法的紧迫性。
English
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.