LLM 위험 결정에서 결과 수준의 유사성과 메커니즘 수준의 일치성 탐색: 세인트 피터스버그 게임을 통한 증거

초록

LLM은 위험 의사 결정 과업에서 신중하게 보일 수 있으나, 신중해 보이는 출력이 반드시 인간의 의사 결정 메커니즘과의 정렬을 의미하는 것은 아니다. 우리는 이러한 구분을 통제된 시험장으로서 세인트피터즈버그 게임을 사용하여 조사한다. 이 게임은 기대 보수가 무한대임에도 인간은 일반적으로 낮고 유한한 지불 의향을 보고하는 고전적 역설이다. 우리는 28개의 LLM을 평가하기 위해 구조화된 프롬프트 세트를 사용하였으며, 여기에는 원래 게임, 절단, 반복 실행, 숫자 부여, 직업 정체성을 교란하는 통제된 의사 결정 변형, 모델이 인간 의사 결정자처럼 추론하도록 요구하는 인간 관점 프롬프트, 그리고 기본 모델과 명령어 조정 버전 간의 쌍대 비교가 포함된다. 원래 게임에서 대부분의 모델은 유한한 입찰가를 생성하여 인간과 유사한 위험 행동의 외관을 만들어낸다. 그러나 이러한 결과 수준의 유사성은 상당한 메커니즘 수준의 차이를 가린다. 통제된 변형은 모델이 원래 게임에서 보인 인간과 유사한 행동을 유지하기보다는 조건부 및 계산적 합리적 행동으로 전환하는 경우가 많음을 드러낸다. 인간 신호 프롬프트와 명령어 조정은 종종 입찰가를 낮추고 일부 가시적인 병리를 줄이지만, 대부분의 메커니즘 수준 반응 패턴은 거의 변하지 않는다. 이러한 발견은 위험 의사 결정에서의 행동적 정렬이 표면적일 수 있음을 보여준다. LLM은 인간과 일관된 메커니즘을 나타내지 않으면서 인간과 유사한 위험 결정을 생성할 수 있다. 따라서 LLM 의사 결정의 고위험 평가는 결과 유사성을 넘어 정렬이 메커니즘 수준의 일관성에 의해 뒷받침되는지 검토해야 한다.

English

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.