形式不確定性的語法：在自動推理任務中何時信任大型語言模型

摘要

大型語言模型（LLMs）在生成形式化規格方面展現出顯著的潛力，有助於自動推理的普及化。然而，存在一個根本性的矛盾：LLMs是概率性的，而形式驗證則要求確定性的保證。本文通過全面調查LLM生成的形式化產物中的失敗模式和不確定性量化（UQ），來解決這一認識論上的差距。我們對五個前沿LLMs的系統性評估揭示了基於可滿足性模理論（SMT）的自動形式化在不同領域對準確性的影響（從邏輯任務的+34.8%到事實任務的-44.5%），而已知的UQ技術（如詞元概率的熵）未能識別這些錯誤。我們引入了一個概率上下文無關文法（PCFG）框架來建模LLM的輸出，從而得到了一個精細化的不確定性分類。我們發現不確定性信號是任務依賴的（例如，邏輯任務的文法熵，AUROC>0.93）。最後，這些信號的輕量級融合實現了選擇性驗證，顯著減少了錯誤（14-100%），且僅需最小程度的棄權，從而將LLM驅動的形式化轉變為一門可靠的工程學科。

English

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

形式不確定性的語法：在自動推理任務中何時信任大型語言模型

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

摘要

Support