ChatPaper.aiChatPaper

MoReBench:评估语言模型中的程序性与多元道德推理,超越结果导向

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

October 18, 2025
作者: Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine
cs.AI

摘要

随着人工智能系统的不断进步,我们越来越多地依赖它们来与我们共同决策或为我们做出决策。为了确保这些决策与人类价值观相一致,我们不仅需要理解它们做出了什么决策,还要了解它们是如何得出这些决策的。推理语言模型不仅提供最终答案,还展示(部分透明的)中间思考轨迹,这为我们研究AI的程序性推理提供了及时的契机。与数学和编程问题通常有客观正确答案不同,道德困境是聚焦过程评估的绝佳试验场,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景都配有一套专家认为在推理这些场景时必须包含(或避免)的评分标准。MoReBench涵盖了超过2.3万条标准,包括识别道德考量、权衡利弊以及提供可操作的建议,以覆盖AI在辅助人类道德决策及自主做出道德决策时的各种情况。此外,我们还精心编制了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五大主要框架下进行推理。我们的研究结果表明,规模定律及现有的数学、编程和科学推理任务基准无法有效预测模型执行道德推理的能力。模型还表现出对特定道德框架(如边沁的行为功利主义和康德的义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了以过程为中心的推理评估,朝着更安全、更透明的AI迈进。
English
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
PDF02October 22, 2025