ChatPaper.aiChatPaper

理性推演抑或修辞技巧?大型语言模型中道德推理解释的实证分析

Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

March 23, 2026
作者: Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain
cs.AI

摘要

大型语言模型是否具备道德推理能力,抑或仅是形似而已?我们通过科尔伯格道德发展阶段的框架,探究LLM对道德困境的回应究竟呈现真实的发展性递进,还是说对齐训练仅仅产出了表面类似成熟道德判断的推理式输出,却缺乏内在发展轨迹。采用经三种评判模型验证的LLM即评判者流水线,我们对涵盖不同架构、参数规模和训练方案的13个LLM生成的600余条回应进行分类,这些回应针对六类经典道德困境,并通过十项互补分析来解析所得模式的本质与内在一致性。研究结果揭示出惊人逆转:无论模型规模、架构或提示策略如何,回应均压倒性地对应后习俗推理阶段(第5-6阶段),这与人类以第4阶段为主导的发展常态形成倒置。更值得注意的是,部分模型表现出道德脱钩现象:即陈述的道德理由与行动选择之间存在系统性不一致。这种逻辑不自洽现象跨越模型规模和提示策略持续存在,是独立于修辞复杂度的直接推理一致性失败。模型规模虽具有统计显著性但实际影响微弱;训练类型无显著独立主效应;模型表现出近乎机械化的跨困境一致性,对语义迥异的道德问题生成逻辑无法区分的回应。我们推断这些模式为道德腹语现象提供了证据:通过对齐训练习得成熟道德推理的修辞惯例,却未形成这些惯例本应表征的底层发展轨迹。
English
Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
PDF21March 26, 2026