基於輸入的推理：透過個例程式合成實現邏輯推演

摘要

大型语言模型（LLMs）在零样本推理方面表现卓越，但在处理复杂的多步骤推理时仍存在困难。近期通过添加中间推理步骤（如思维链CoT和程序化思维PoT）增强LLMs的方法虽能提升性能，却常产生不理想的解决方案，尤其在算法领域。我们提出实例级程序合成方法（PIPS），该方法利用结构性反馈在实例层面生成并优化程序，且无需依赖任务特定指导或显式测试用例。为进一步提升性能，PIPS引入了置信度度量机制，可基于每个实例动态选择直接推理或程序合成路径。在三种前沿LLMs及30个基准测试（包括Big Bench超难集全部任务、视觉问答任务、关系推理任务和数学推理任务）上的实验表明：相较于PoT和CoT，PIPS将绝对调和平均准确率最高分别提升8.6%和9.4%；在算法任务中，与Gemini-2.0-Flash的PoT相比，PIPS将不良程序生成量降低65.1%。

English

Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

基於輸入的推理：透過個例程式合成實現邏輯推演

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

摘要

Support