추론인가 수사인가? 대규모 언어 모델의 도덕적 추론 설명에 대한 실증 분석

초록

대규모 언어 모델은 도덕적으로 추론하는 것일까, 아니면 그렇게 들리게만 하는 것일까? 우리는 LLM의 도덕적 딜레마에 대한 응답이 콜버그의 도덕성 발달 단계를 통해 진정한 발달적 진행을 보여주는지, 아니면 정렬 훈련이 근본적인 발달 궤적 없이 성숙한 도덕적 판단을 표면적으로 닮은 추론 같은 출력을 생성하는지 조사한다. 세 가지 판단 모델에서 검증된 LLM-as-judge 채점 파이프라인을 사용하여, 다양한 아키텍처, 매개변수 규모, 훈련 방식을 아우르는 13개 LLM의 600개 이상의 응답을 6개의 고전적 도덕적 딜레마에 대해 분류하고, 결과 패턴의 특성과 내적 일관성을 규명하기 위해 10가지 보완 분석을 수행한다. 우리의 결과는 놀라운 역전 현상을 드러낸다: 응답은 모델 크기, 아키텍처 또는 프롬프트 전략과 무관하게 압도적으로 인습적 수준 이후 추론(5-6단계)에 해당하는 반면, 인간 발달 규범에서는 4단계가 지배하는 효과적인 역전 현상이 나타난다. 가장 두드러지게, 모델들의 일부는 도덕적 분리 현상을 보인다: 명시된 도덕적 정당화와 행동 선택 사이의 체계적 불일치로, 이는 규모와 프롬프트 전략에 걸쳐 지속되고 수사적 정교함과 무관한 직접적인 추론 일관성 실패를 나타내는 논리적 비일관성의 한 형태다. 모델 규모는 통계적으로 유의미하지만 실질적으로는 작은 영향을 미치며; 훈련 유형은 유의미한 독립적 주효과가 없고; 모델들은 의미론적으로 구별되는 도덕적 문제에 걸쳐 논리적으로 구별할 수 없는 응답을 생성하며 거의 로봇 같은 딜레마 간 일관성을 보인다. 우리는 이러한 패턴이 정렬 훈련을 통해, 그러한 관습이 나타내려는 근본적인 발달 궤적 없이 성숙한 도덕적 추론의 수사적 관습을 습득하는 '도덕적 배후 조종'의 증거를 구성한다고 가정한다.

English

Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.

추론인가 수사인가? 대규모 언어 모델의 도덕적 추론 설명에 대한 실증 분석

Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

초록

Support