ChatPaper.aiChatPaper

形式不確定性的語法:在自動推理任務中何時信任大型語言模型

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

May 26, 2025
作者: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
cs.AI

摘要

大型語言模型(LLMs)在生成形式化規格方面展現出顯著的潛力,有助於自動推理的普及化。然而,存在一個根本性的矛盾:LLMs是概率性的,而形式驗證則要求確定性的保證。本文通過全面調查LLM生成的形式化產物中的失敗模式和不確定性量化(UQ),來解決這一認識論上的差距。我們對五個前沿LLMs的系統性評估揭示了基於可滿足性模理論(SMT)的自動形式化在不同領域對準確性的影響(從邏輯任務的+34.8%到事實任務的-44.5%),而已知的UQ技術(如詞元概率的熵)未能識別這些錯誤。我們引入了一個概率上下文無關文法(PCFG)框架來建模LLM的輸出,從而得到了一個精細化的不確定性分類。我們發現不確定性信號是任務依賴的(例如,邏輯任務的文法熵,AUROC>0.93)。最後,這些信號的輕量級融合實現了選擇性驗證,顯著減少了錯誤(14-100%),且僅需最小程度的棄權,從而將LLM驅動的形式化轉變為一門可靠的工程學科。
English
Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
PDF32June 2, 2025