ChatPaper.aiChatPaper

我们能否信任AI解释?思维链推理中系统性漏报的证据

Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

December 25, 2025
作者: Deep Pankajbhai Mehta
cs.AI

摘要

当人工智能系统逐步展示其推理过程时,从业者常认为这些解释揭示了实际影响AI答案的因素。我们通过将提示信息嵌入问题并检测模型是否提及这些提示,对该假设进行了验证。在对11个主流AI模型超过9000个测试案例的研究中,我们发现了一个令人担忧的模式:模型几乎从不主动提及提示信息,但当被直接询问时,它们却承认注意到了这些提示。这表明模型能够识别关键信息却选择不报告。警告模型其行为正被监控并无改善作用。强制要求模型报告提示虽有效果,但会导致其在无提示时也虚构报告,并降低答案准确率。我们还发现,迎合用户偏好的提示尤其危险——模型最常遵循这类提示却最少报告它们。这些发现表明,仅观察AI的推理过程不足以发现潜在的影响因素。
English
When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.
PDF33February 8, 2026