从前有个输入:基于单例程序合成的推理方法
Once Upon an Input: Reasoning via Per-Instance Program Synthesis
October 26, 2025
作者: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
cs.AI
摘要
大型语言模型(LLMs)在零样本推理方面表现出色,但在处理复杂多步推理时仍存在困难。虽然通过引入中间推理步骤(如思维链CoT和程序化思维PoT)的新方法能提升性能,但这些方法常产生不理想的解决方案,尤其在算法领域。我们提出实例级程序合成(PIPS)方法,该方法利用结构反馈在实例层面生成并优化程序,无需依赖任务特定指导或显式测试用例。为进一步提升性能,PIPS引入置信度指标,动态选择逐实例直接推理或程序合成路径。在三种前沿LLMs和30个基准测试(包括Big Bench超难任务全集、视觉问答任务、关系推理任务和数学推理任务)上的实验表明:相较于PoT和CoT,PIPS将绝对调和平均准确率最高提升8.6%和9.4%;在算法任务中,与Gemini-2.0-Flash的PoT相比,PIPS将不良程序生成量降低65.1%。
English
Large language models (LLMs) excel at zero-shot inference but continue to
struggle with complex, multi-step reasoning. Recent methods that augment LLMs
with intermediate reasoning steps such as Chain of Thought (CoT) and Program of
Thought (PoT) improve performance but often produce undesirable solutions,
especially in algorithmic domains. We introduce Per-Instance Program Synthesis
(PIPS), a method that generates and refines programs at the instance-level
using structural feedback without relying on task-specific guidance or explicit
test cases. To further improve performance, PIPS incorporates a confidence
metric that dynamically chooses between direct inference and program synthesis
on a per-instance basis. Experiments across three frontier LLMs and 30
benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question
answering tasks, relational reasoning tasks, and mathematical reasoning tasks
show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and
9.4% compared to PoT and CoT respectively, and reduces undesirable program
generations by 65.1% on the algorithmic tasks compared to PoT with
Gemini-2.0-Flash.