ChatPaper.aiChatPaper

並非所有正確答案都同等重要:為何蒸餾來源至關重要

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

May 20, 2025
作者: Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li
cs.AI

摘要

蒸餾技術已成為提升開源語言模型推理能力的實用且有效的方法。在本研究中,我們通過從三個最先進的教師模型——AM-Thinking-v1、Qwen3-235B-A22B和DeepSeek-R1——收集共享語料庫中189萬個查詢的驗證輸出,進行了大規模的推理數據蒸餾實證研究。我們構建了三個平行數據集並分析了它們的分佈,發現AM-Thinking-v1蒸餾的數據展現出更大的詞元長度多樣性和更低的困惑度。在每個數據集上訓練的學生模型在包括AIME2024、AIME2025、MATH500和LiveCodeBench在內的推理基準上進行了評估。基於AM的模型始終表現最佳(例如,AIME2024上84.3分,AIME2025上72.2分,MATH500上98.4分,LiveCodeBench上65.9分),並展示了適應性輸出行為——對更難的任務生成更長的回應,對更簡單的任務生成更短的回應。這些發現凸顯了高質量、驗證過的推理軌跡的價值。我們發布了AM-Thinking-v1和Qwen3-235B-A22B蒸餾的數據集,以支持未來關於開放且高性能的推理導向語言模型的研究。這些數據集已在Hugging Face上公開提供:\href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled}, https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}。
English
Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging FaceDatasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled}, https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}.

Summary

AI-Generated Summary

PDF71May 21, 2025