扩展代码辅助思维链与指令以增强模型推理能力

摘要

推理能力对于大型语言模型（LLMs）解决复杂任务至关重要，然而实现可靠且可扩展的推理仍面临挑战。尽管链式思维（CoT）提示已成为主流方法，但现有方法常存在生成不可控、质量不足及推理路径多样性有限的问题。近期研究尝试通过将推理步骤与可执行代码相结合来增强CoT，然而这类方法通常局限于预定义的数学问题，限制了其可扩展性和泛化能力。在本研究中，我们提出了Caco（代码辅助链式思维），一个创新框架，通过代码驱动的增强自动化合成高质量、可验证且多样化的指令-Co T推理数据。与先前工作不同，Caco首先在统一代码格式下对基于代码的CoT生成器进行微调，利用现有的数学和编程解决方案，随后将数据生成扩展至大量多样化的推理轨迹。关键之处在于，我们引入了通过代码执行和基于规则的过滤进行自动验证，以确保逻辑正确性和结构多样性，接着将筛选后的输出逆向工程为自然语言指令和语言CoT，以丰富任务适应性。这一闭环过程实现了完全自动化、可扩展的推理数据合成，并保证了可执行性。在我们创建的Caco-1.3M数据集上的实验表明，经Caco训练的模型在数学推理基准测试中展现出强劲的竞争力，超越了现有强基线。进一步分析揭示，Caco的代码锚定验证和指令多样性有助于在未见任务上实现更优的泛化。我们的工作为构建无需人工干预、自我维持且可信赖的推理系统确立了一种范式。

English

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

扩展代码辅助思维链与指令以增强模型推理能力

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

摘要

Support