ChatPaper.aiChatPaper

DeonticBench:规则推理能力基准测试框架

DeonticBench: A Benchmark for Reasoning over Rules

April 6, 2026
作者: Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme
cs.AI

摘要

大型语言模型(LLMs)在基于复杂情境化规则进行推理时仍面临挑战。在法律和政策场景中,这种挑战具体表现为道义推理:即根据明示规则对义务、权限和禁令进行逻辑推演。尽管当前多数基准测试侧重于短文本数学推理,但针对长文本高风险道义推理的研究仍显不足。为填补这一空白,我们推出DEONTICBENCH基准测试集,涵盖美国联邦税法、航空行李政策、美国移民管理及州级住房法律等领域的6,232项推理任务。这些任务支持多种解决路径,既可通过语言直接推理,也可借助符号计算完成。除自由形式的思维链推理外,该基准还提供可选的基于求解器的工作流程:模型将法规和案例事实转化为可执行的Prolog代码,从而形成形式化问题解读及显式的程序执行轨迹。我们为所有测试实例发布了参考Prolog程序。在顶尖LLMs与代码模型的测试中,SARA数值题的最优硬子集准确率仅为44.4%,住房法律题的宏观F1值仅达46.6。我们进一步研究了监督微调与强化学习在符号程序生成中的训练效果。虽然训练能提升Prolog代码生成质量,但现有强化学习方法仍无法稳定解决此类任务。总体而言,DEONTICBENCH为研究现实领域中以情境化规则为基础的推理能力提供了符号化与非符号化双轨评估基准。
English
Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.
PDF41April 10, 2026