大型語言模型駭客攻擊:量化使用LLM進行文本註釋的潛在風險
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation
September 10, 2025
作者: Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy
cs.AI
摘要
大型語言模型(LLMs)正迅速改變社會科學研究,它們能夠自動化處理如數據標註和文本分析等耗時任務。然而,LLM的輸出結果會因研究者的實施選擇(例如模型選擇、提示策略或溫度設置)而顯著不同。這種變異可能引入系統性偏差和隨機誤差,這些誤差會傳播到下游分析中,導致第一類、第二類、S類或M類錯誤。我們將此現象稱為LLM駭客攻擊。
我們通過複製21篇已發表社會科學研究中的37個數據標註任務,並使用18種不同模型,量化了LLM駭客攻擊的風險。在分析1300萬個LLM標籤後,我們測試了2361個現實假設,以衡量研究者的選擇如何影響統計結論。我們發現,基於LLM標註數據的結論在大約三分之一的最先進模型假設中是不正確的,而在小型語言模型中,這一比例達到一半。雖然我們的研究表明,更高的任務性能和更好的模型通用能力可以降低LLM駭客攻擊的風險,但即使是高度準確的模型也無法完全消除這種風險。隨著效應量的增加,LLM駭客攻擊的風險降低,這表明在顯著性閾值附近需要更嚴格的驗證。我們對LLM駭客攻擊緩解技術的廣泛分析強調了人工標註在減少假陽性發現和改進模型選擇中的重要性。令人驚訝的是,常見的回歸估計器校正技術在降低LLM駭客攻擊風險方面效果甚微,因為它們在第一類和第二類錯誤之間進行了大量權衡。
除了意外錯誤外,我們還發現,有意進行的LLM駭客攻擊異常簡單。只需使用少數幾個LLM和幾個提示改寫,任何結果都可以被呈現為統計顯著。
English
Large language models (LLMs) are rapidly transforming social science research
by enabling the automation of labor-intensive tasks like data annotation and
text analysis. However, LLM outputs vary significantly depending on the
implementation choices made by researchers (e.g., model selection, prompting
strategy, or temperature settings). Such variation can introduce systematic
biases and random errors, which propagate to downstream analyses and cause Type
I, Type II, Type S, or Type M errors. We call this LLM hacking.
We quantify the risk of LLM hacking by replicating 37 data annotation tasks
from 21 published social science research studies with 18 different models.
Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure
how plausible researcher choices affect statistical conclusions. We find
incorrect conclusions based on LLM-annotated data in approximately one in three
hypotheses for state-of-the-art models, and in half the hypotheses for small
language models. While our findings show that higher task performance and
better general model capabilities reduce LLM hacking risk, even highly accurate
models do not completely eliminate it. The risk of LLM hacking decreases as
effect sizes increase, indicating the need for more rigorous verification of
findings near significance thresholds. Our extensive analysis of LLM hacking
mitigation techniques emphasizes the importance of human annotations in
reducing false positive findings and improving model selection. Surprisingly,
common regression estimator correction techniques are largely ineffective in
reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors.
Beyond accidental errors, we find that intentional LLM hacking is
unacceptably simple. With few LLMs and just a handful of prompt paraphrases,
anything can be presented as statistically significant.