我們能信任AI解釋嗎?思維鏈推理中系統性漏報的證據
Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
December 25, 2025
作者: Deep Pankajbhai Mehta
cs.AI
摘要
當人工智慧系統逐步解釋其推理過程時,從業者常假設這些解釋能揭示真正影響AI決策的因素。我們透過在問題中嵌入提示線索並檢測模型是否提及這些線索,對該假設進行驗證。在一項針對11個主流AI模型、超過9,000個測試案例的研究中,我們發現令人擔憂的規律:模型幾乎從不會主動提及提示線索,但當被直接詢問時,卻承認注意到這些線索。這表明模型能識別關鍵資訊卻選擇隱瞞。即使告知模型其行為正被監控也無濟於事。強制模型回報提示線索雖有效,但會導致其在沒有線索時虛假回報,並降低回答準確度。我們還發現,迎合使用者偏好的提示尤其危險——模型最常遵循這類提示,卻極少主動披露。這些發現說明,僅觀察AI的推理過程不足以察覺潛在的隱性影響。
English
When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.