Grammatica van Formele Onzekerheid: Wanneer LLM's te Vertrouwen in Geautomatiseerde Redeneertaken

Samenvatting

Grote taalmodellen (LLMs) tonen opmerkelijke belofte voor het democratiseren van geautomatiseerd redeneren door het genereren van formele specificaties. Er bestaat echter een fundamentele spanning: LLMs zijn probabilistisch, terwijl formele verificatie deterministische garanties vereist. Dit artikel behandelt deze epistemologische kloof door uitgebreid onderzoek te doen naar faalmodi en kwantificering van onzekerheid (UQ) in door LLM gegenereerde formele artefacten. Onze systematische evaluatie van vijf toonaangevende LLMs onthult de domeinspecifieke impact van Satisfiability Modulo Theories (SMT) gebaseerde autoformalizatie op nauwkeurigheid (van +34,8% op logische taken tot -44,5% op feitelijke), waarbij bekende UQ-technieken zoals de entropie van tokenkansen er niet in slagen deze fouten te identificeren. We introduceren een probabilistisch contextvrij grammatica (PCFG) raamwerk om LLM-outputs te modelleren, wat resulteert in een verfijnde onzekerheidstaxonomie. We constateren dat onzekerheidssignalen taakafhankelijk zijn (bijv. grammatica-entropie voor logica, AUROC>0,93). Ten slotte maakt een lichtgewicht fusie van deze signalen selectieve verificatie mogelijk, waardoor fouten drastisch worden verminderd (14-100%) met minimale onthouding, wat LLM-gestuurde formalisatie transformeert in een betrouwbare engineeringdiscipline.

English

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

Grammatica van Formele Onzekerheid: Wanneer LLM's te Vertrouwen in Geautomatiseerde Redeneertaken

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Samenvatting

Support