機器胡言:探析大型語言模型對真實性的漸進式漠視
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
July 10, 2025
作者: Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, Jaime Fernández Fisac
cs.AI
摘要
根據哲學家哈里·法蘭克福的概念,胡扯(Bullshit)指的是不考慮其真實價值而做出的陳述。雖然先前的研究已經探討了大語言模型(LLM)的幻覺和諂媚現象,但我們提出「機器胡扯」作為一個總體概念框架,使研究人員能夠描述LLM中出現的廣泛真實性喪失現象,並揭示其潛在機制。我們引入了「胡扯指數」,這是一種量化LLM對真相漠視的新指標,並提出了一個補充分類法,分析了四種定性形式的胡扯:空洞修辭、模棱兩可、含糊其辭和未經證實的聲明。我們在Marketplace數據集、政治中立性數據集以及我們新設計的BullshitEval基準(涵蓋100個AI助手的2400個場景)上進行了實證評估,這些數據集專門用於評估機器胡扯。我們的結果表明,通過人類反饋強化學習(RLHF)進行的模型微調顯著加劇了胡扯現象,而推理時的思維鏈(CoT)提示則特別放大了某些胡扯形式,尤其是空洞修辭和模棱兩可。我們還觀察到在政治背景下普遍存在的機器胡扯,其中含糊其辭是主要策略。我們的研究結果凸顯了AI對齊中的系統性挑戰,並為實現更真實的LLM行為提供了新的見解。
English
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to
statements made without regard to their truth value. While previous work has
explored large language model (LLM) hallucination and sycophancy, we propose
machine bullshit as an overarching conceptual framework that can allow
researchers to characterize the broader phenomenon of emergent loss of
truthfulness in LLMs and shed light on its underlying mechanisms. We introduce
the Bullshit Index, a novel metric quantifying LLMs' indifference to truth, and
propose a complementary taxonomy analyzing four qualitative forms of bullshit:
empty rhetoric, paltering, weasel words, and unverified claims. We conduct
empirical evaluations on the Marketplace dataset, the Political Neutrality
dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI
assistants) explicitly designed to evaluate machine bullshit. Our results
demonstrate that model fine-tuning with reinforcement learning from human
feedback (RLHF) significantly exacerbates bullshit and inference-time
chain-of-thought (CoT) prompting notably amplify specific bullshit forms,
particularly empty rhetoric and paltering. We also observe prevalent machine
bullshit in political contexts, with weasel words as the dominant strategy. Our
findings highlight systematic challenges in AI alignment and provide new
insights toward more truthful LLM behavior.