忠実性指標は忠実性を測定しない：グラウンドトゥルースを用いたメタ評価

要旨

思考連鎖（CoT）は、大規模言語モデルの挙動を解釈・監査する上で中心的な手法となっている。しかし、これらの痕跡がモデルの予測背後にある計算を忠実に表現しないことが多いという証拠が増えつつある。これまでにいくつかの忠実性指標が提案されてきたが、それらが実際に忠実性を測定しているかどうかは未だ不明である。この問いに答えるには、内部計算が直接観測できないために入手が困難なグラウンドトゥルースラベルが必要となる。その結果、指標を提案する研究の多くは絶対的なスコアや過去の指標との比較のみを報告しており、既存の少数のベンチマークでは、忠実性とは直交する特性である妥当性や重要度といった代理指標に依存しており、CoTが信頼できるかどうかについて誤解を招く可能性がある。我々はこの課題に対処するため、出力からどのような中間計算が生成されたかを明らかにできるタスクを構築し、ステップレベルおよびCoTレベルでグラウンドトゥルースの忠実性ラベルを生成する自動ラベリングパイプラインを開発した。この方法論に基づき、13タスク・10モデルにわたる3,066個のラベル付きCoTからなるベンチマークBonaFideを提示し、これを利用して著名な忠実性指標の初の系統的評価を実施する。実験の結果、ほとんどの指標は偶然のレベルに近く、強い予測バイアスを示し、長いCoTでは性能が低下することが明らかになった。最良の指標でもCoTレベルでAUROC 0.70、ステップレベルで0.59に達するに過ぎず、いずれも設定を越えて転移せず、しかも法外に高い計算コストを伴う。我々の結果は、現在の忠実性評価における根本的なギャップを明らかにし、より信頼性が高く効率的な指標の開発を求めるものである。

English

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.