효율적 추론을 위한 탐구: CoT 증류를 위한 데이터 중심 벤치마크

초록

데이터 증강, 선택, 혼합을 포함한 데이터 중심의 지식 증류는 강력한 추론 능력을 유지하면서 더 작고 효율적인 학생용 대형 언어 모델(LLMs)을 만드는 유망한 방법을 제공한다. 그러나 각 지식 증류 접근법의 효과를 체계적으로 평가하기 위한 포괄적인 벤치마크가 여전히 부족하다. 본 논문은 방법론, 모델, 데이터 관점에서 사고 연쇄(CoT) 지식 증류에서의 데이터 조작을 조사하는 첫 번째 데이터 중심 벤치마크인 DC-CoT를 소개한다. 다양한 교사 모델(예: o4-mini, Gemini-Pro, Claude-3.5)과 학생 아키텍처(예: 3B, 7B 파라미터)를 활용하여, 이러한 데이터 조작이 학생 모델의 성능에 미치는 영향을 여러 추론 데이터셋에서 엄격히 평가하며, 특히 내부 분포(IID)와 외부 분포(OOD) 일반화 및 교차 도메인 전이에 초점을 맞춘다. 우리의 연구 결과는 데이터 중심 기술을 통해 CoT 지식 증류를 최적화하기 위한 실행 가능한 통찰을 제공하고 최선의 실천 방법을 확립함으로써, 궁극적으로 더 접근 가능하고 능력 있는 추론 모델의 개발을 촉진하는 것을 목표로 한다. 데이터셋은 https://huggingface.co/datasets/rana-shahroz/DC-COT에서 확인할 수 있으며, 코드는 https://anonymous.4open.science/r/DC-COT-FF4C/에서 공유된다.

English

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.

효율적 추론을 위한 탐구: CoT 증류를 위한 데이터 중심 벤치마크

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

초록

Support