从前有个输入：基于单例程序合成的推理方法

摘要

大型语言模型（LLMs）在零样本推理方面表现出色，但在处理复杂多步推理时仍存在困难。虽然通过引入中间推理步骤（如思维链CoT和程序化思维PoT）的新方法能提升性能，但这些方法常产生不理想的解决方案，尤其在算法领域。我们提出实例级程序合成（PIPS）方法，该方法利用结构反馈在实例层面生成并优化程序，无需依赖任务特定指导或显式测试用例。为进一步提升性能，PIPS引入置信度指标，动态选择逐实例直接推理或程序合成路径。在三种前沿LLMs和30个基准测试（包括Big Bench超难任务全集、视觉问答任务、关系推理任务和数学推理任务）上的实验表明：相较于PoT和CoT，PIPS将绝对调和平均准确率最高提升8.6%和9.4%；在算法任务中，与Gemini-2.0-Flash的PoT相比，PIPS将不良程序生成量降低65.1%。

English

Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

从前有个输入：基于单例程序合成的推理方法

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

摘要

Support