论大语言模型中的代码引导推理

摘要

已有研究表明，代码数据能够增强大语言模型（LLMs）的推理能力，但尚不清楚代码的哪些方面对此贡献最大。我们采用一种系统化、以数据为中心的研究框架来探讨这一问题。我们构建了十种编程语言的并行指令数据集，并应用了选择性破坏代码结构或语义属性的受控扰动。随后，我们在每种变体上对来自五个模型家族、八种规模的LLMs进行微调，并评估它们在自然语言、数学及代码任务上的表现。通过3,331次实验，我们的结果显示，LLMs对结构扰动的脆弱性高于语义扰动，尤其在数学和代码任务上。适当的抽象形式，如伪代码和流程图，与代码同样有效，同时以更少的token编码相同信息且不严格遵循原语法，往往能保持甚至提升性能。值得注意的是，即使代码被破坏并带有误导信号，只要表面规律性得以维持，其表现仍具竞争力。最后，语法风格也影响任务特定增益，Python更利于自然语言推理，而Java和Rust等低级语言则更利于数学推理。通过这一系统化框架，我们旨在深入理解代码不同属性如何影响推理，并为设计提升LLM推理能力的训练数据提供指导。

English

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.

论大语言模型中的代码引导推理

On Code-Induced Reasoning in LLMs

摘要

Support