Non Tutte le Risposte Corrette Sono Uguali: Perché la Fonte della Distillazione è Importante

Abstract

La distillazione si è affermata come un approccio pratico ed efficace per potenziare le capacità di ragionamento dei modelli linguistici open-source. In questo lavoro, conduciamo uno studio empirico su larga scala sulla distillazione di dati di ragionamento, raccogliendo output verificati da tre modelli insegnanti all'avanguardia—AM-Thinking-v1, Qwen3-235B-A22B e DeepSeek-R1—su un corpus condiviso di 1,89 milioni di query. Costruiamo tre dataset paralleli e analizziamo le loro distribuzioni, rivelando che i dati distillati da AM-Thinking-v1 mostrano una maggiore diversità nella lunghezza dei token e una minore perplessità. I modelli studente addestrati su ciascun dataset vengono valutati su benchmark di ragionamento come AIME2024, AIME2025, MATH500 e LiveCodeBench. Il modello basato su AM ottiene costantemente le migliori prestazioni (ad esempio, 84,3 su AIME2024, 72,2 su AIME2025, 98,4 su MATH500 e 65,9 su LiveCodeBench) e dimostra un comportamento adattivo nella produzione di output—risposte più lunghe per compiti più difficili e più brevi per quelli più semplici. Questi risultati evidenziano il valore di tracce di ragionamento verificate e di alta qualità. Rilasciamo i dataset distillati di AM-Thinking-v1 e Qwen3-235B-A22B per supportare future ricerche su modelli linguistici open-source e ad alte prestazioni orientati al ragionamento. I dataset sono pubblicamente disponibili su Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled}, https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}.

English

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging FaceDatasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled}, https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}.

Non Tutte le Risposte Corrette Sono Uguali: Perché la Fonte della Distillazione è Importante

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Abstract

Support