CHIMERA：面向可泛化大語言模型推理的緊湊型合成數據

摘要

近期，大型語言模型（LLMs）展現出卓越的推理能力，這主要得益於基於高質量推理數據的監督微調（SFT）和強化學習（RL）後訓練。然而，在開放且可擴展的環境中復現並擴展這些能力，正面臨三個根本性的數據挑戰：（1）冷啟動問題，源於缺乏包含詳細、長鏈思維（CoT）軌跡的種子數據集來初始化推理策略；（2）領域覆蓋有限，現有開源推理數據集多集中於數學領域，對更廣泛科學學科的覆蓋不足；（3）標註瓶頸，前沿級推理任務的難度使得可靠的人工標註成本過高或難以實現。為應對這些挑戰，我們提出CHIMERA——一個包含9K樣本的緊湊型合成推理數據集，旨在實現可泛化的跨領域推理。CHIMERA具備三大關鍵特性：（1）提供由頂尖推理模型生成的豐富長鏈CoT推理軌跡；（2）具有廣闊且結構化的覆蓋範圍，涵蓋8大科學領域並通過模型生成的層次化分類體系組織超過1K細分主題；（3）採用全自動可擴展評估流程，使用強推理模型交叉驗證問題有效性與答案正確性。我們使用CHIMERA對4B參數的Qwen3模型進行後訓練。儘管數據集規模適中，所得模型在GPQA-Diamond、AIME 24/25/26、HMMT 25及Humanity's Last Exam等挑戰性推理基準測試中表現強勁，其推理能力接近或匹敵DeepSeek-R1、Qwen3-235B等規模更大的模型。

English

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.

CHIMERA：面向可泛化大語言模型推理的緊湊型合成數據

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

摘要

Support