CHIMERA: 일반화 가능한 LLM 추론을 위한 컴팩트 합성 데이터

초록

대규모 언어 모델(LLM)은 최근 고품질 추론 데이터에 대한 지도 미세 조정(SFT) 및 강화 학습(RL) 기반 사후 훈련을 통해 뛰어난 추론 능력을 보여주고 있습니다. 그러나 이러한 능력을 개방적이고 확장 가능한 환경에서 재현하고 확장하는 것은 세 가지 근본적인 데이터 중심 과제에 의해 방해받고 있습니다: (1) 추론 정책을 초기화하는 데 필요한 상세하고 긴 사고 연쇄(CoT) 궤적을 가진 시드 데이터셋의 부족에서 비롯되는 콜드 스타트 문제, (2) 대부분의 기존 오픈소스 추론 데이터셋이 수학 분야에 집중되어 더 넓은 과학 분야를 포괄하는 범위가 제한되는 문제, (3) 첨단 수준 추론 과제의 난이도로 인해 신뢰할 수 있는 인간 주석 작업이 극도로 비싸거나 불가능해지는 주석 병목 현상입니다. 이러한 과제를 해결하기 위해 우리는 일반화 가능한 교차 도메인 추론을 위한 9K 샘플로 구성된 컴팩트한 합성 추론 데이터셋인 CHIMERA를 소개합니다. CHIMERA는 세 가지 핵심 속성으로 구성됩니다: (1) 최첨단 추론 모델들이 합성한 풍부하고 긴 CoT 추론 궤적을 제공하며, (2) 모델 생성 계층 분류 체계를 통해 조직화된 1,000개 이상의 세분화된 주제와 8개의 주요 과학 분야를 아우르는 광범위하고 구조화된 coverage를 가지며, (3) 강력한 추론 모델을 사용하여 문제 타당성과 답변 정확성을 교차 검증하는 완전 자동화된 확장 가능한 평가 파이프라인을 채택합니다. 우리는 CHIMERA를 사용하여 4B Qwen3 모델을 사후 훈련합니다. 데이터셋의 규모가 작음에도 불구하고, 결과 모델은 GPQA-Diamond, AIME 24/25/26, HMMT 25, Humanity's Last Exam을 포함한 일련의 도전적인 추론 벤치마크에서 강력한 성능을 달성하며, DeepSeek-R1 및 Qwen3-235B와 같은 상당히 큰 모델들의 추론 성능에 근접하거나 이를 따라잡습니다.

English

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.

CHIMERA: 일반화 가능한 LLM 추론을 위한 컴팩트 합성 데이터

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

초록

Support