MonitorBench: 大規模言語モデルにおける連鎖思考モニタリング可能性の包括的ベンチマーク

要旨

大規模言語モデル（LLM）は、必ずしも最終出力の因果的根拠とならない連鎖的思考（CoT）を生成することがある。このような不一致が生じると、CoTはもはやモデルの挙動を駆動する決定的要因を忠実に反映せず、CoTの監視可能性低下問題を引き起こす。しかし、CoT監視可能性を研究するための包括的かつ完全にオープンソースのベンチマークは未だ不足している。この課題に対処するため、我々はLLMにおけるCoT監視可能性を評価する体系的なベンチマーク「MonitorBench」を提案する。MonitorBenchは以下を提供する：（1）LLMの挙動を駆動する要因を監視する際にCoTが何时有効かを特徴づけるため、7カテゴリ19タスクにわたる1,514の多様なテスト事例と、注意深く設計された決定的要因群；（2）CoT監視可能性が劣化する程度を定量化する2種類のストレステスト設定。様々な能力を持つ複数の主要LLMを用いた大規模実験により、最終目標応答の生成に決定的要因に対する構造的推論が必要な場合にCoT監視可能性が高いことが示された。クローズドソースLLMは一般に監視可能性が低く、監視可能性とモデル能力には負の相関が存在する。さらに、オープンソース・クローズドソース双方のLLMは、ストレステスト下で意図的に監視可能性を低下させ得ることが明らかとなり、決定的要因に対する構造的推論を必要としない一部のタスクでは監視可能性が最大30%低下した。これらの実証的知見に加え、MonitorBenchは将来のLLM評価、高度なストレステスト監視技術の研究、新たな監視手法の開発に関するさらなる研究の基盤を提供する。

English

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

MonitorBench: 大規模言語モデルにおける連鎖思考モニタリング可能性の包括的ベンチマーク

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

要旨

Support