预训练播种，微调塑形：大语言模型认知偏差起源的案例研究

摘要

大型语言模型（LLMs）展现出认知偏差——即系统性的非理性决策倾向，与人类观察到的类似。先前研究发现，这些偏差因模型而异，并可能通过指令微调被放大。然而，尚不清楚这些偏差差异是源于预训练、微调，还是训练过程中的随机噪声所致。我们提出了一种两步因果实验方法以厘清这些因素。首先，我们使用不同随机种子多次微调模型，研究训练随机性如何影响超过30种认知偏差。其次，我们引入交叉微调——在模型间交换指令数据集以隔离偏差来源。这种交换使用导致不同偏差模式的数据集，直接测试偏差是否依赖于数据集。我们的发现表明，尽管训练随机性引入了一定变异性，但偏差主要由预训练塑造：拥有相同预训练骨干的模型比仅共享微调数据的模型展现出更相似的偏差模式。这些见解提示，理解微调模型中的偏差需超越微调效应，考虑其预训练起源。这一视角可指导未来开发评估和缓解LLMs偏差的原则性策略。

English

Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over 30 cognitive biases. Second, we introduce cross-tuning -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

预训练播种，微调塑形：大语言模型认知偏差起源的案例研究

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

摘要

Support