ChatPaper.aiChatPaper

Loong:通过验证器大规模合成长链思维

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

September 3, 2025
作者: Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Zhuangzhuang He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma, Hao Shen, Hao Sun, Beibei Wang, Fangyijie Wang, Hao Wang, Haoran Wang, Yang Wang, Yifeng Wang, Zhaowei Wang, Ziyang Wang, Yifan Wu, Zikai Xiao, Chengxing Xie, Fan Yang, Junxiao Yang, Qianshuo Ye, Ziyu Ye, Guangtao Zeng, Yuwen Ebony Zhang, Zeyu Zhang, Zihao Zhu, Bernard Ghanem, Philip Torr, Guohao Li
cs.AI

摘要

近期,大型语言模型(LLMs)的研究进展表明,通过可验证奖励的强化学习(RLVR),其推理能力在数学和编程等领域得到了显著提升,这些领域的正确答案可自动评估。然而,将这一成功扩展到其他推理密集型领域仍面临挑战,主要由于高质量、可验证数据集的稀缺以及人工监督的高昂成本。在本研究中,我们推出了龙项目(Loong Project):一个开源框架,旨在跨多种推理密集型领域实现可扩展的合成数据生成与验证。该框架包含两大核心组件:(1)龙基准(LoongBench),一个精选的种子数据集,涵盖12个领域(如高等数学、化学、逻辑学)的8,729个经过人工审核的示例,每个示例均配有可执行代码和丰富的元数据;(2)龙环境(LoongEnv),一个模块化的合成数据生成环境,支持多种提示策略以生成新的问题-答案-代码三元组。这两部分共同构成了一个代理-环境循环,支持强化学习,其中基于LLM的代理因生成与代码执行答案一致的思维链(CoT)解决方案而获得奖励。实证中,我们在广泛的开源和专有LLMs上对龙基准进行了基准测试,以评估领域覆盖范围并揭示性能瓶颈。此外,我们对龙环境生成的合成数据进行了全面分析,考察了其正确性、难度和多样性。代码及文档可在https://github.com/camel-ai/loong获取。
English
Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.
PDF41September 5, 2025