擴展代碼輔助思維鏈與指令以促進模型推理

摘要

推理能力對於大型語言模型（LLMs）解決複雜任務至關重要，然而實現可靠且可擴展的推理仍具挑戰性。儘管思維鏈（Chain-of-Thought, CoT）提示已成為主流方法，現有方法常面臨生成不可控、質量不足及推理路徑多樣性有限的問題。近期研究利用代碼來增強CoT，通過將推理基於可執行的步驟，但此類方法通常受限於預定義的數學問題，阻礙了其可擴展性和泛化能力。本文提出Caco（Code-Assisted Chain-of-ThOught），一種新穎的框架，通過代碼驅動的增強自動化合成高質量、可驗證且多樣化的指令-CoT推理數據。與先前工作不同，Caco首先在統一代碼格式下對基於代碼的CoT生成器進行微調，利用現有的數學和編程解決方案，隨後將數據生成擴展至大量多樣化的推理軌跡。關鍵在於，我們引入了通過代碼執行和基於規則的過濾進行自動化驗證，以確保邏輯正確性和結構多樣性，繼而將過濾後的輸出逆向工程為自然語言指令和語言CoT，從而豐富任務適應性。這一閉環過程實現了完全自動化、可擴展的推理數據合成，並保證了可執行性。在我們創建的Caco-1.3M數據集上的實驗表明，經Caco訓練的模型在數學推理基準測試中表現出強勁的競爭力，超越了現有的強基線。進一步分析揭示，Caco的代碼錨定驗證和指令多樣性有助於在未見任務上實現優越的泛化能力。我們的工作為構建無需人工干預、自我維持且可信賴的推理系統確立了範式。

English

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

擴展代碼輔助思維鏈與指令以促進模型推理

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

摘要

Support