Loong: 検証器を用いた大規模な長い連鎖思考の合成

要旨

大規模言語モデル（LLMs）の最近の進展により、検証可能な報酬を用いた強化学習（Reinforcement Learning with Verifiable Reward, RLVR）を通じて、特に数学やプログラミングのような領域において、その推論能力が大幅に向上することが示されている。これらの領域では、正解の真偽を自動的に評価することが可能である。しかし、他の推論集約的な領域においても同様の成功を拡張することは、高品質で検証可能なデータセットの不足や人的監視のコストの高さから、依然として課題となっている。本研究では、Loongプロジェクトを紹介する。これは、多様な推論集約的な領域において、スケーラブルな合成データ生成と検証を行うためのオープンソースフレームワークである。このフレームワークは、2つの主要なコンポーネントで構成されている。(1) LoongBench：12の領域（例：高等数学、化学、論理学）にわたる8,729の人間による検証済みの例を含むキュレーションされたシードデータセットで、それぞれに実行可能なコードと豊富なメタデータが付属している。(2) LoongEnv：モジュール型の合成データ生成環境で、複数のプロンプト戦略をサポートし、新しい質問-回答-コードの三つ組を生成する。これらのコンポーネントは、エージェントと環境のループを形成し、強化学習を可能にする。ここでは、LLMベースのエージェントが、コード実行された回答と一致するChain-of-Thought（CoT）ソリューションを生成した場合に報酬を受け取る。実証的に、LoongBenchを広範なオープンソースおよびプロプライエタリのLLMでベンチマークし、領域カバレッジを評価し、性能のボトルネックを明らかにする。さらに、LoongEnvによって生成された合成データの正しさ、難易度、多様性を包括的に分析する。コードとドキュメントはhttps://github.com/camel-ai/loongで公開されている。

English

Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.

Loong: 検証器を用いた大規模な長い連鎖思考の合成

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

要旨

Support