大语言模型安全隐患：量化使用LLM进行文本标注的潜在风险

摘要

大型语言模型（LLMs）正迅速改变社会科学研究，通过自动化数据标注和文本分析等劳动密集型任务。然而，LLM的输出结果会因研究者的实施选择（如模型选择、提示策略或温度设置）而显著变化。这种变化可能引入系统性偏差和随机误差，进而影响下游分析，导致I型、II型、S型或M型错误。我们称这种现象为LLM操控。我们通过复制21项已发表社会科学研究中的37个数据标注任务，并使用18种不同模型，量化了LLM操控的风险。在分析1300万条LLM标注后，我们测试了2361个现实假设，以衡量研究者选择对统计结论的影响。我们发现，在使用最先进模型时，约三分之一的假设基于LLM标注数据得出错误结论；而在使用小型语言模型时，这一比例上升至一半。尽管我们的研究显示，任务性能的提升和模型通用能力的增强能降低LLM操控风险，但即使是高精度模型也无法完全消除这一风险。随着效应量的增大，LLM操控的风险降低，这表明在接近显著性阈值时，需要更严格的验证。我们对LLM操控缓解技术的广泛分析强调了人工标注在减少假阳性发现和优化模型选择中的重要性。令人惊讶的是，常见的回归估计校正技术在降低LLM操控风险方面效果甚微，因为它们在很大程度上在I型与II型错误之间进行权衡。除了无意的错误，我们还发现有意进行LLM操控异常简单。仅需少数LLM和少量提示改写，任何结果都能被呈现为统计显著。

English

Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.

大语言模型安全隐患：量化使用LLM进行文本标注的潜在风险

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

摘要

Support