ChatPaper.aiChatPaper

MoReBench:評估語言模型中的程序性與多元性道德推理,超越結果導向

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

October 18, 2025
作者: Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine
cs.AI

摘要

隨著人工智慧系統的進步,我們越來越依賴它們與我們共同或代替我們做出決策。為了確保這些決策與人類價值觀一致,我們不僅需要理解它們做出了什麼決策,還需要了解它們是如何得出這些決策的。推理語言模型不僅提供最終回應,還展示(部分透明的)中間思考軌跡,這為研究AI的程序性推理提供了及時的機會。與數學和編碼問題通常有客觀正確答案不同,道德困境是進行過程導向評估的絕佳測試平台,因為它們允許多種可辯護的結論。為此,我們提出了MoReBench:包含1,000個道德情境,每個情境都配備了一組專家認為在推理這些情境時必須包含(或避免)的評分標準。MoReBench涵蓋了超過23,000條標準,包括識別道德考量、權衡利弊以及提供可操作的建議,以覆蓋AI在為人類提供道德決策建議以及自主做出道德決策的各種情況。此外,我們還整理了MoReBench-Theory:150個示例,用於測試AI是否能在規範倫理學的五大主要框架下進行推理。我們的結果表明,規模定律以及現有的數學、編碼和科學推理任務基準無法預測模型執行道德推理的能力。模型還表現出對特定道德框架(如邊沁的行為功利主義和康德的義務論)的偏愛,這可能是流行訓練範式的副作用。這些基準共同推動了過程導向的推理評估,朝著更安全、更透明的AI邁進。
English
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
PDF02October 22, 2025