忠實性指標無法衡量忠實性：一項基於真實基準的後設評估

摘要

思維鏈（CoT）已成為解讀與審查大型語言模型行為的核心工具。然而，越來越多證據顯示，這些軌跡往往未能忠實反映模型預測背後的實際計算過程。雖然已有數個忠實度指標被提出，但它們是否確實衡量了忠實度仍屬未知。回答此問題需要真實標籤，但由於內部計算過程無法直接觀察，這類標籤難以取得。因此，多數提出指標的研究僅報告絕對分數或與既有指標的比較，而少數現有基準則依賴於可解釋性或重要性等代理變數，這些屬性與忠實度正交，可能誤導我們對思維鏈可信度的判斷。我們透過建構一組任務來應對此挑戰，這些任務的輸出能揭示產生它們所必須經歷的中間計算過程，並開發一套自動化標註流程，從而產生步驟層級與思維鏈層級的忠實度真實標籤。基於此方法，我們提出了 BonaFide 基準測試，涵蓋 13 項任務與 10 個模型，共計 3,066 條已標註的思維鏈，並利用它對知名忠實度指標進行首次系統性評測。實驗結果顯示，多數指標表現接近隨機水準，存在強烈的預測偏差，且在較長思維鏈上表現退化。最佳指標在思維鏈層級僅達到 0.70 AUROC，另一指標在步驟層級僅達到 0.59，兩者均無法跨設定遷移，同時伴隨極高的計算成本。我們的結果揭示了當前忠實度評估中的根本性差距，並呼籲發展更可靠且更高效的度量指標。

English

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.