形式不确定性语法:在自动推理任务中何时信任大语言模型
Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
May 26, 2025
作者: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
cs.AI
摘要
大型语言模型(LLMs)在通过生成形式化规范来普及自动化推理方面展现出显著潜力。然而,一个根本性的矛盾在于:LLMs本质上是概率性的,而形式验证则要求确定性保证。本文通过全面探究LLM生成的形式化产物中的失败模式及不确定性量化(UQ),致力于弥合这一认识论鸿沟。我们对五个前沿LLMs的系统评估揭示了基于可满足性模理论(SMT)的自动形式化在不同领域对准确性的影响(从逻辑任务上提升+34.8%到事实任务上下降-44.5%),而诸如标记概率熵等已知UQ技术未能有效识别这些错误。我们引入了一种概率上下文无关文法(PCFG)框架来建模LLM输出,从而提炼出更精细的不确定性分类体系。研究发现,不确定性信号具有任务依赖性(例如,逻辑任务中的文法熵,AUROC>0.93)。最终,通过轻量级融合这些信号,实现了选择性验证,在最小化弃权的情况下大幅减少了错误(14-100%),将LLM驱动的形式化转变为一项可靠的工程学科。
English
Large language models (LLMs) show remarkable promise for democratizing
automated reasoning by generating formal specifications. However, a fundamental
tension exists: LLMs are probabilistic, while formal verification demands
deterministic guarantees. This paper addresses this epistemological gap by
comprehensively investigating failure modes and uncertainty quantification (UQ)
in LLM-generated formal artifacts. Our systematic evaluation of five frontier
LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's
domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on
factual ones), with known UQ techniques like the entropy of token probabilities
failing to identify these errors. We introduce a probabilistic context-free
grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty
taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy
for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables
selective verification, drastically reducing errors (14-100%) with minimal
abstention, transforming LLM-driven formalization into a reliable engineering
discipline.Summary
AI-Generated Summary