CHIMERA:面向可泛化大语言模型推理的紧凑合成数据集
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
March 1, 2026
作者: Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng
cs.AI
摘要
近期,大型语言模型(LLMs)展现出卓越的推理能力,这主要得益于基于高质量推理数据的监督微调(SFT)和强化学习(RL)后训练。然而,在开放可扩展的环境中复现和扩展这些能力面临三个根本性的数据挑战:(1)冷启动问题,由于缺乏包含详细长链思维(CoT)轨迹的种子数据集来初始化推理策略;(2)领域覆盖有限,现有开源推理数据集多集中于数学领域,对更广泛科学学科的覆盖不足;(3)标注瓶颈,前沿级推理任务的难度使得可靠的人工标注成本过高或难以实现。为应对这些挑战,我们推出CHIMERA——一个包含9K样本的紧凑型合成推理数据集,旨在实现可泛化的跨领域推理。CHIMERA具备三大关键特性:(1)提供由前沿推理模型生成的丰富长链CoT推理轨迹;(2)具有广覆盖且结构化的特点,涵盖8大科学学科并通过模型生成的层次化分类体系组织逾1K细粒度主题;(3)采用全自动可扩展的评估流程,利用强推理模型交叉验证问题有效性与答案正确性。我们使用CHIMERA对4B参数的Qwen3模型进行后训练。尽管数据集规模有限,所得模型在GPQA-Diamond、AIME 24/25/26、HMMT 25及Humanity's Last Exam等挑战性推理基准上表现强劲,其推理能力接近或匹敌DeepSeek-R1、Qwen3-235B等参数量大得多的模型。
English
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.