道义基准：规则推理评估基准

摘要

大型语言模型（LLMs）在处理具有复杂情境特定规则的推理任务时仍面临挑战。在法律和政策场景中，这种挑战表现为道义推理：即在明确规则下对义务、权限和禁令进行推理。尽管当前多数基准测试侧重于短文本数学推理，但针对长文本高风险道义推理的研究仍显不足。为填补这一空白，我们推出DEONTICBENCH基准测试集，涵盖美国联邦税法、航空行李政策、美国移民管理及州级住房法律等领域的6,232项任务。这些任务可通过多种方式解决，包括直接语言推理或借助符号计算。除自由形式的思维链推理外，DEONTICBENCH还支持可选的基于求解器的工作流程：模型将法规和案例事实转化为可执行的Prolog代码，从而形成形式化问题解释和显式程序追踪。我们为所有实例发布了参考Prolog程序。在顶尖LLMs和代码模型中，硬性子集的最佳表现仅为SARA数值任务的44.4%准确率和住房法律任务的46.6宏F1分数。我们进一步研究了基于监督微调和强化学习的符号程序生成训练。虽然训练提升了Prolog生成质量，但现有强化学习方法仍无法可靠解决这些任务。总体而言，DEONTICBENCH为研究现实领域中基于情境的规则推理提供了符号化与非符号化双轨并行的基准测试框架。

English

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.