論大型語言模型中的代碼引導推理

摘要

已有研究表明，程式碼數據能夠增強大型語言模型（LLMs）的推理能力，但尚不清楚程式碼的哪些方面對此貢獻最大。我們採用一個系統化、以數據為中心的框架來探討這一問題。我們構建了十種程式語言的平行指令數據集，並應用受控的擾動，選擇性地破壞程式碼的結構或語義特性。隨後，我們在每個變體上對來自五個模型家族、八種規模的LLMs進行微調，並評估它們在自然語言、數學和程式碼任務上的表現。通過3,331次實驗，我們的結果顯示，LLMs對結構性擾動比語義性擾動更為敏感，尤其是在數學和程式碼任務上。適當的抽象形式，如偽代碼和流程圖，可以與程式碼一樣有效，同時以更少的token編碼相同信息而不嚴格遵循原始語法，往往能保持甚至提升性能。值得注意的是，即使帶有誤導信號的損壞程式碼，只要表層規律性得以保持，仍能保持競爭力。最後，語法風格也影響任務特定的增益，Python有利於自然語言推理，而像Java和Rust這樣的低階語言則更有利於數學推理。通過我們的系統化框架，我們旨在深入理解程式碼的不同特性如何影響推理，並為設計增強LLM推理能力的訓練數據提供指導。

English

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.

論大型語言模型中的代碼引導推理

On Code-Induced Reasoning in LLMs

摘要

Support