MonitorBench: un benchmark completo per la monitorabilità del ragionamento a catena nei grandi modelli linguistici

Abstract

I grandi modelli linguistici (LLM) possono generare catene di pensiero (CoT) che non sono sempre causalmente responsabili dei loro output finali. Quando si verifica una tale discrepanza, la CoT non riflette più fedelmente i fattori critici per la decisione che guidano il comportamento del modello, portando al problema della ridotta monitorabilità delle CoT. Tuttavia, manca ancora un benchmark completo e completamente open-source per studiare la monitorabilità delle CoT. Per colmare questa lacuna, proponiamo MonitorBench, un benchmark sistematico per valutare la monitorabilità delle CoT negli LLM. MonitorBench fornisce: (1) un insieme diversificato di 1.514 istanze di test con fattori critici per la decisione accuratamente progettati, relativi a 19 attività che coprono 7 categorie, per caratterizzare quando le CoT possono essere utilizzate per monitorare i fattori che guidano il comportamento degli LLM; e (2) due impostazioni di stress-test per quantificare la misura in cui la monitorabilità delle CoT può essere degradata. Esperimenti approfonditi su molteplici LLM popolari con capacità variabili mostrano che la monitorabilità delle CoT è più elevata quando la produzione della risposta target finale richiede un ragionamento strutturato attraverso il fattore critico per la decisione. Gli LLM closed-source mostrano generalmente una monitorabilità inferiore e esiste una relazione negativa tra monitorabilità e capacità del modello. Inoltre, sia gli LLM open-source che quelli closed-source possono ridurre intenzionalmente la monitorabilità sotto stress-test, con un calo della monitorabilità fino al 30% in alcune attività che non richiedono un ragionamento strutturato sui fattori critici per la decisione. Oltre a queste intuizioni empiriche, MonitorBench fornisce una base per ulteriori ricerche sulla valutazione dei futuri LLM, sullo studio di tecniche avanzate di monitorabilità sotto stress-test e sullo sviluppo di nuovi approcci di monitoraggio.

English

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

MonitorBench: un benchmark completo per la monitorabilità del ragionamento a catena nei grandi modelli linguistici

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Abstract

Support