案例到代码：使用合成数据学习归纳推理

Case2Code: Learning Inductive Reasoning with Synthetic Data

July 17, 2024

作者: Yunfan Shao, Linyang Li, Yichuan Ma, Peiji Li, Demin Song, Qinyuan Cheng, Shimin Li, Xiaonan Li, Pengyu Wang, Qipeng Guo, Hang Yan, Xipeng Qiu, Xuanjing Huang, Dahua Lin

cs.AI

摘要

复杂推理是大型语言模型（LLMs）展示的令人印象深刻的能力。大多数LLMs擅长演绎推理，例如思维链激发或迭代工具使用，以逐步解决具有挑战性的任务。在本文中，我们希望专注于评估和教导LLMs进行归纳推理，即LLMs应该通过观察示例或顺序转换来推断潜在规则。然而，收集大规模和多样化的人类生成的归纳数据具有挑战性。我们专注于在代码领域进行数据合成，并通过利用程序的表达能力和正确性提出了一个Case2Code任务。具体来说，我们收集了一组多样化的可执行程序，为每个程序合成输入输出转换，并迫使LLMs根据合成的I/O案例推断出底层代码实现。我们首先评估了代表性的LLMs在合成的Case2Code任务上的表现，并展示了Case-to-code归纳对LLMs而言是具有挑战性的。然后，我们合成了大规模的Case2Code训练样本，以训练LLMs进行归纳推理。实验结果表明，这种归纳训练不仅有助于分布式Case2Code性能，还增强了经过训练的LLMs的各种编码能力，展示了通过合成数据学习归纳推理的巨大潜力。

English

Complex reasoning is an impressive ability shown by large language models (LLMs). Most LLMs are skilled in deductive reasoning, such as chain-of-thought prompting or iterative tool-using to solve challenging tasks step-by-step. In this paper, we hope to focus on evaluating and teaching LLMs to conduct inductive reasoning, that is, LLMs are supposed to infer underlying rules by observing examples or sequential transformations. However, collecting large-scale and diverse human-generated inductive data is challenging. We focus on data synthesis in the code domain and propose a Case2Code task by exploiting the expressiveness and correctness of programs. Specifically, we collect a diverse set of executable programs, synthesize input-output transformations for each program, and force LLMs to infer the underlying code implementations based on the synthetic I/O cases. We first evaluate representative LLMs on the synthesized Case2Code task and demonstrate that the Case-to-code induction is challenging for LLMs. Then, we synthesize large-scale Case2Code training samples to train LLMs to perform inductive reasoning. Experimental results show that such induction training benefits not only in distribution Case2Code performance but also enhances various coding abilities of trained LLMs, demonstrating the great potential of learning inductive reasoning via synthetic data.