忠实度指标并不衡量忠实性：一项基于真值数据的元评估

摘要

思维链（CoTs）已成为解释和审计大型语言模型行为的核心工具。然而，越来越多证据表明，这些推理轨迹往往无法真实反映模型预测背后的计算过程。虽然已有若干忠实度指标被提出，但这些指标是否真正衡量了忠实度仍属未知。要回答该问题需要真实标签，但由于内部计算过程不可直接观测，获取此类标签十分困难。因此，大多数提出新指标的研究仅报告绝对分数或与以往指标的对比结果，而现有的少数基准测试则依赖于合理性或重要性等代理变量——这些与忠实度正交的属性可能会误导对思维链可信度的判断。为解决这一挑战，我们构建了输出结果能揭示其产生过程中必要中间计算步骤的任务，并开发了自动化标注流程，可在步骤级和思维链级生成真实的忠实度标签。基于该方法，我们提出了BonaFide基准测试——涵盖13个任务、10个模型的3066条带标签思维链，并首次系统评估了主流忠实度指标。实验表明，多数指标表现接近随机猜测，存在显著预测偏差，且在较长思维链上性能下降。最佳指标在思维链层面仅达到0.70 AUROC，另一指标在步骤层面为0.59 AUROC，两者既无法跨场景迁移，又需承担高昂的计算成本。我们的研究结果揭示了当前忠实度评估的根本性缺陷，亟需开发更可靠高效的评估指标。

English

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.