推論か修辞か？大規模言語モデルにおける道徳的推論説明の実証分析

要旨

大規模言語モデルは道徳的に推論しているのか、それとも単にそのように聞こえるだけなのか？本研究では、道徳的ジレンマに対するLLMの応答が、コールバーグの道徳性発達段階における真の発達的推移を示すのか、あるいは、アライメント調整による訓練が、基礎となる発達軌道を伴わずに、表面的に成熟した道徳判断に似た推論的な出力を生成しているのかを調査する。3つの判定モデルで検証されたLLM-as-judge採点パイプラインを用いて、様々なアーキテクチャ、パラメータ規模、訓練方法にわたる13のLLMから得られた600以上の応答を6つの古典的道徳的ジレンマに分類し、10の補完的分析を行って、結果として得られたパターンの性質と内的整合性を特徴づける。我々の結果は顕著な逆転現象を明らかにした：応答は、モデルサイズ、アーキテクチャ、またはプロンプト戦略に関わらず、圧倒的に脱慣習的水準（第5～6段階）に対応しており、第4段階が支配的である人間の発達規範とは実質的に逆のパターンを示した。最も顕著なのは、一部のモデルが道徳的乖離を示した点である：表明された道徳的正当化と行動選択の間の体系的な不一致であり、これは規模やプロンプト戦略を超えて持続する論理的矛盾の一形態であり、修辞的な洗練度とは独立した直接的な推論整合性の失敗を表す。モデル規模は統計的に有意ではあるが実用的には小さな効果しか持たず、訓練タイプには有意な独立した主効果はなく、モデルはほとんどロボット的な、ジレンマを超えた一貫性を示し、意味的に異なる道徳的問題に対して論理的に区別不能な応答を生成した。我々は、これらのパターンが、アライメント訓練を通じて、それらの修辞的慣習が表現することを意図された基礎的な発達軌道なしに、成熟した道徳的推論の修辞的慣習を獲得するという、道徳的腹話術の証拠を構成すると考える。

English

Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.

推論か修辞か？大規模言語モデルにおける道徳的推論説明の実証分析

Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

要旨

Support