CHIMERA：一般化可能なLLM推論のためのコンパクト合成データ

要旨

大規模言語モデル（LLM）は近年、高品質な推論データに対する教師ありファインチューニング（SFT）および強化学習（RL）に基づく事後学習により、顕著な推論能力を示している。しかし、オープンでスケーラブルな環境においてこれらの能力を再現・拡張するには、以下の3つの根本的なデータ中心の課題が障壁となっている：（1）推論ポリシーを初期化するために必要な詳細かつ長い思考の連鎖（CoT）軌跡を含むシードデータセットの欠如に起因するコールドスタート問題；（2）既存のオープンソース推論データセットの大半が数学分野に集中しており、より広範な科学分野のカバレッジが限られているというドメインカバレッジの制約；（3）フロンティアレベルの推論タスクの難易度の高さから、信頼性の高い人手アノテーションが非常に高コストまたは非現実的となるアノテーションのボトルネックである。これらの課題に対処するため、我々は汎用的なクロスドメイン推論向けに9Kサンプルから構成されるコンパクトな合成推論データセットCHIMERAを提案する。CHIMERAは以下の3つの主要な特性を備えて構築されている：（1）最先端の推論モデルによって合成された豊富で長いCoT推論軌跡を提供；（2）モデル生成の階層的分類体系に基づき組織化された1,000以上の細粒度トピックにまたがる8つの主要科学分野をカバーする広範かつ構造化されたカバレッジ；（3）問題の妥当性と回答の正しさを相互検証するために強力な推論モデルを使用する完全自動化・スケーラブルな評価パイプラインを採用。我々はCHIMERAを用いて4BパラメータのQwen3モデルを事後学習した。データセットのサイズは控えめであるにもかかわらず、得られたモデルはGPQA-Diamond、AIME 24/25/26、HMMT 25、Humanity's Last Examなど、一連の難易度の高い推論ベンチマークで強力な性能を発揮し、DeepSeek-R1やQwen3-235Bといった大幅に大規模なモデルの推論性能に迫る、あるいは同等の結果を示した。

English

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.

CHIMERA：一般化可能なLLM推論のためのコンパクト合成データ

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

要旨

Support