MonitorBench：大型語言模型思維鏈可監測性綜合基準測試框架

摘要

大型语言模型（LLMs）生成的思维链（CoTs）与其最终输出之间并非总是存在因果关联。当这种不匹配发生时，思维链便无法真实反映驱动模型行为的关键决策因素，从而导致思维链可监测性降低的问题。然而，目前仍缺乏一个全面且完全开源的基准来研究思维链可监测性。为填补这一空白，我们提出MonitorBench——一个用于评估LLMs思维链可监测性的系统性基准。该基准提供：（1）涵盖7大类19项任务的1,514个多样化测试实例，这些实例通过精心设计的关键决策因素来刻画思维链监测LLM行为驱动因素的应用场景；（2）两种压力测试设置，用于量化思维链可监测性的衰减程度。通过对多个不同能力的流行LLMs进行广泛实验表明：当生成最终目标响应需要通过对关键决策因素进行结构化推理时，思维链可监测性更高。闭源LLMs通常表现出较低的可监测性，且可监测性与模型能力呈负相关。此外，在压力测试下，开源和闭源LLMs均可有意降低可监测性，在某些无需对关键决策因素进行结构化推理的任务中，可监测性最多下降30%。除这些实证发现外，MonitorBench还为未来LLMs评估、先进压力测试监测技术研究以及新型监测方法开发提供了研究基础。

English

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

MonitorBench：大型語言模型思維鏈可監測性綜合基準測試框架

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

摘要

Support