Loong: 검증기를 통해 대규모로 장기 사고 사슬 합성하기

초록

대규모 언어 모델(LLM)의 최근 발전은 검증 가능한 보상과의 강화 학습(RLVR)을 통해 특히 수학 및 프로그래밍과 같은 분야에서 그들의 추론 능력이 크게 향상될 수 있음을 보여주었습니다. 이러한 분야에서는 정답을 자동으로 평가할 수 있기 때문입니다. 그러나 이러한 성공을 다른 추론 집약적인 분야로 확장하는 것은 고품질의 검증 가능한 데이터셋의 부족과 인간 감독의 높은 비용으로 인해 여전히 어려운 과제로 남아 있습니다. 본 연구에서는 다양한 추론 집약적인 분야에 걸쳐 확장 가능한 합성 데이터 생성 및 검증을 위한 오픈소스 프레임워크인 Loong 프로젝트를 소개합니다. 이 프레임워크는 두 가지 주요 구성 요소로 이루어져 있습니다: (1) LoongBench, 12개 분야(예: 고급 수학, 화학, 논리)에 걸쳐 8,729개의 인간 검증 예제를 포함한 선별된 시드 데이터셋으로, 각 예제는 실행 가능한 코드와 풍부한 메타데이터와 짝을 이루고 있습니다; (2) LoongEnv, 새로운 질문-답변-코드 삼중항을 생성하기 위해 여러 프롬프트 전략을 지원하는 모듈식 합성 데이터 생성 환경입니다. 이 두 구성 요소는 강화 학습을 가능하게 하는 에이전트-환경 루프를 형성하며, 여기서 LLM 기반 에이전트는 코드 실행 결과와 일치하는 사고 사슬(CoT) 솔루션을 생성할 때 보상을 받습니다. 실증적으로, 우리는 LoongBench을 다양한 오픈소스 및 독점 LLM에 대해 벤치마킹하여 도메인 커버리지를 평가하고 성능 병목 현상을 밝혀냈습니다. 또한, LoongEnv에 의해 생성된 합성 데이터의 정확성, 난이도, 다양성을 종합적으로 분석했습니다. 코드와 문서는 https://github.com/camel-ai/loong에서 확인할 수 있습니다.

English

Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.

Loong: 검증기를 통해 대규모로 장기 사고 사슬 합성하기

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

초록

Support