ChatPaper.aiChatPaper

CHIMERA:面向可泛化大語言模型推理的緊湊型合成數據

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

March 1, 2026
作者: Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng
cs.AI

摘要

近期,大型語言模型(LLMs)展現出卓越的推理能力,這主要得益於基於高質量推理數據的監督微調(SFT)和強化學習(RL)後訓練。然而,在開放且可擴展的環境中復現並擴展這些能力,正面臨三個根本性的數據挑戰:(1)冷啟動問題,源於缺乏包含詳細、長鏈思維(CoT)軌跡的種子數據集來初始化推理策略;(2)領域覆蓋有限,現有開源推理數據集多集中於數學領域,對更廣泛科學學科的覆蓋不足;(3)標註瓶頸,前沿級推理任務的難度使得可靠的人工標註成本過高或難以實現。為應對這些挑戰,我們提出CHIMERA——一個包含9K樣本的緊湊型合成推理數據集,旨在實現可泛化的跨領域推理。CHIMERA具備三大關鍵特性:(1)提供由頂尖推理模型生成的豐富長鏈CoT推理軌跡;(2)具有廣闊且結構化的覆蓋範圍,涵蓋8大科學領域並通過模型生成的層次化分類體系組織超過1K細分主題;(3)採用全自動可擴展評估流程,使用強推理模型交叉驗證問題有效性與答案正確性。我們使用CHIMERA對4B參數的Qwen3模型進行後訓練。儘管數據集規模適中,所得模型在GPQA-Diamond、AIME 24/25/26、HMMT 25及Humanity's Last Exam等挑戰性推理基準測試中表現強勁,其推理能力接近或匹敵DeepSeek-R1、Qwen3-235B等規模更大的模型。
English
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
PDF321March 4, 2026