探究大语言模型风险决策中的结果层面相似性与机制层面一致性：来自圣彼得堡博弈的证据

摘要

LLMs在风险决策任务中可能表现得谨慎，然而这些看似谨慎的输出并不一定意味着其与人类决策机制一致。我们以圣彼得堡博弈作为受控测试平台来探究这一差异——这是一个经典悖论，即期望收益无限大，但人类通常只愿意支付有限且较低的金额。我们通过一套结构化提示对28个LLM进行评估，其中包括：原始博弈；受控决策变体（分别扰动截断值、重复博弈次数、数字禀赋以及职业身份）；要求模型以人类决策者身份进行推理的人类视角提示；以及基础模型与其指令调优版本的配对比较。在原始博弈中，大多数模型给出有限出价，呈现出类似人类的风险行为。然而，这种结果层面的相似性掩盖了显著的机制层面差异。受控变体揭示，模型并未维持原始博弈中观察到的人类类行为，而是往往转向条件性且计算理性的行为。人类线索提示和指令调优通常会降低出价并减少某些明显的异常现象，但大多数机制层面的响应模式基本保持不变。这些发现表明，风险决策中的行为对齐可能仅停留在表面层次：LLM可能产生类似人类的风险决策，却未展现出与人类一致的机制。因此，对LLM决策的高风险评估应当超越结果相似性，进一步检验这种对齐是否得到机制层面一致性的支撑。

English

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.