植根於預訓練，受制於微調：大語言模型認知偏差起源之個案研究

摘要

大型語言模型（LLMs）展現出認知偏誤——即系統性的非理性決策傾向，與人類所見相似。先前的研究發現，這些偏誤在不同模型間存在差異，並可能因指令微調而加劇。然而，這些偏誤差異究竟源於預訓練、微調，還是訓練隨機性帶來的隨機噪聲，尚不明確。我們提出了一種兩步因果實驗方法來區分這些因素。首先，我們使用不同的隨機種子多次微調模型，以研究訓練隨機性如何影響超過30種認知偏誤。其次，我們引入了交叉微調——在模型間交換指令數據集，以隔離偏誤來源。這種交換使用導致不同偏誤模式的數據集，直接測試偏誤是否依賴於數據集。我們的研究結果表明，雖然訓練隨機性引入了一定的變異性，但偏誤主要由預訓練塑造：擁有相同預訓練骨幹的模型比僅共享微調數據的模型展現出更相似的偏誤模式。這些見解表明，理解微調模型中的偏誤需要考慮其預訓練起源，而不僅僅是微調效應。這一視角可指導未來開發評估和減輕LLMs偏誤的原則性策略。

English

Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over 30 cognitive biases. Second, we introduce cross-tuning -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

植根於預訓練，受制於微調：大語言模型認知偏差起源之個案研究

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

摘要

Support