形式不确定性语法：在自动推理任务中何时信任大语言模型

摘要

大型语言模型（LLMs）在通过生成形式化规范来普及自动化推理方面展现出显著潜力。然而，一个根本性的矛盾在于：LLMs本质上是概率性的，而形式验证则要求确定性保证。本文通过全面探究LLM生成的形式化产物中的失败模式及不确定性量化（UQ），致力于弥合这一认识论鸿沟。我们对五个前沿LLMs的系统评估揭示了基于可满足性模理论（SMT）的自动形式化在不同领域对准确性的影响（从逻辑任务上提升+34.8%到事实任务上下降-44.5%），而诸如标记概率熵等已知UQ技术未能有效识别这些错误。我们引入了一种概率上下文无关文法（PCFG）框架来建模LLM输出，从而提炼出更精细的不确定性分类体系。研究发现，不确定性信号具有任务依赖性（例如，逻辑任务中的文法熵，AUROC>0.93）。最终，通过轻量级融合这些信号，实现了选择性验证，在最小化弃权的情况下大幅减少了错误（14-100%），将LLM驱动的形式化转变为一项可靠的工程学科。

English

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

形式不确定性语法：在自动推理任务中何时信任大语言模型

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

摘要

Support