理性推演抑或修辞技巧?大型语言模型中道德推理解释的实证分析
Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
March 23, 2026
作者: Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain
cs.AI
摘要
大型语言模型是否具备道德推理能力,抑或仅是看似如此?我们通过科尔伯格道德发展阶段的框架,探究LLM对道德困境的回应是否展现真正的发展性递进,还是说对齐训练仅仅产生了表面类似成熟道德判断的推理式输出,却缺乏内在发展轨迹。采用经三种评判模型验证的LLM即评判者评分流程,我们对涵盖不同架构、参数规模和训练方案的13个LLM在六大经典道德困境中产生的600余条回应进行分类,并开展十项互补分析以解析所得模式的本质与内在一致性。
研究结果揭示出惊人的倒置现象:无论模型规模、架构或提示策略如何,回应均压倒性地对应后习俗推理阶段(第5-6阶段),这与人类以第4阶段为主导的发展常态形成根本倒置。尤为显著的是,部分模型表现出道德脱钩现象:即陈述的道德理由与行为选择之间存在系统性不一致。这种逻辑不连贯性在不同规模和提示策略下持续存在,构成独立于修辞复杂度的直接推理一致性失败。模型规模虽具有统计学显著但实际微弱的影响;训练类型无显著独立主效应;模型表现出近乎机械化的跨困境一致性,对语义迥异的道德问题产生逻辑无法区分的回应。
我们提出这些模式构成了道德腹语效应的证据:通过对齐训练习得成熟道德推理的修辞范式,却未能形成这些范式本应表征的底层发展轨迹。
English
Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.