충실도 메트릭은 충실도를 측정하지 않는다: 실제값을 활용한 메타 평가

초록

사고 체인(CoT)은 대규모 언어 모델의 행동을 해석하고 검증하는 데 핵심적인 역할을 해 왔다. 그러나 점점 더 많은 증거는 이러한 추적 기록이 모델 예측 뒤에 있는 계산 과정을 충실하게 대표하지 못하는 경우가 많음을 시사한다. 여러 충실성 지표가 제안되었지만, 이들이 실제로 충실성을 측정하는지 여부는 여전히 알려져 있지 않다. 이에 답하기 위해서는 실측 레이블이 필요한데, 내부 계산 과정이 직접 관찰 가능하지 않기 때문에 이를 얻기가 어렵다. 결과적으로, 지표를 제안하는 대부분의 연구는 절대 점수나 이전 지표와의 비교만을 보고하며, 소수의 기존 벤치마크는 타당성이나 중요성과 같은 대리 지표에 의존한다. 이러한 속성들은 충실성과 직교하며, CoT를 신뢰할 수 있는지에 대해 오해를 불러일으킬 수 있다. 우리는 출력이 어떤 중간 계산 과정에 의해 생성되었는지 드러내는 과제를 구성하고, 단계 및 CoT 수준에서 실측 충실성 레이블을 생성하는 자동화된 레이블링 파이프라인을 개발함으로써 이 문제에 대처한다. 이 방법론을 바탕으로, 우리는 13개 과제와 10개 모델에 걸친 3,066개의 레이블이 부여된 CoT로 구성된 벤치마크인 BonaFide를 제시하고, 이를 활용하여 주요 충실성 지표에 대한 최초의 체계적 평가를 수행한다. 실험 결과, 대부분의 지표는 무작위 수준에 가까운 성능을 보이며, 강한 예측 편향을 나타내고, 더 긴 CoT에서 성능이 저하된다. 최고 지표는 CoT 수준에서 AUROC 0.70에, 다른 지표는 단계 수준에서 0.59에 도달하지만, 두 지표 모두 설정 간 전이가 불가능할 뿐만 아니라 금지적으로 높은 계산 비용을 수반한다. 우리의 결과는 현재 충실성 평가의 근본적인 격차를 드러내며, 더 신뢰할 수 있고 효율적인 지표의 개발을 촉구한다.

English

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.