高效推理的探索:面向思维链蒸馏的数据中心化基准
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
May 24, 2025
作者: Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen
cs.AI
摘要
以数据为中心的蒸馏技术,包括数据增强、筛选与混合,为创建更小巧、高效且保持强大推理能力的学生大语言模型(LLMs)开辟了一条前景广阔的道路。然而,目前尚缺乏一个全面的基准来系统评估每种蒸馏方法的效果。本文介绍了DC-CoT,这是首个从方法、模型和数据三个维度探究思维链(CoT)蒸馏中数据操作的数据中心化基准。通过运用多种教师模型(如o4-mini、Gemini-Pro、Claude-3.5)及学生架构(如3B、7B参数),我们严格评估了这些数据操作对学生模型在多个推理数据集上表现的影响,重点关注了分布内(IID)与分布外(OOD)泛化能力,以及跨领域迁移效果。我们的研究成果旨在提供可操作的洞见,确立通过数据中心化技术优化CoT蒸馏的最佳实践,从而推动开发更易获取且能力更强的推理模型。数据集可在https://huggingface.co/datasets/rana-shahroz/DC-COT获取,代码则共享于https://anonymous.4open.science/r/DC-COT-FF4C/。
English
Data-centric distillation, including data augmentation, selection, and
mixing, offers a promising path to creating smaller, more efficient student
Large Language Models (LLMs) that retain strong reasoning abilities. However,
there still lacks a comprehensive benchmark to systematically assess the effect
of each distillation approach. This paper introduces DC-CoT, the first
data-centric benchmark that investigates data manipulation in chain-of-thought
(CoT) distillation from method, model and data perspectives. Utilizing various
teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student
architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of
these data manipulations on student model performance across multiple reasoning
datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD)
generalization, and cross-domain transfer. Our findings aim to provide
actionable insights and establish best practices for optimizing CoT
distillation through data-centric techniques, ultimately facilitating the
development of more accessible and capable reasoning models. The dataset can be
found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is
shared in https://anonymous.4open.science/r/DC-COT-FF4C/.Summary
AI-Generated Summary