ChatPaper.aiChatPaper

Loong:通過驗證器大規模合成長鏈思維

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

September 3, 2025
作者: Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Zhuangzhuang He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma, Hao Shen, Hao Sun, Beibei Wang, Fangyijie Wang, Hao Wang, Haoran Wang, Yang Wang, Yifeng Wang, Zhaowei Wang, Ziyang Wang, Yifan Wu, Zikai Xiao, Chengxing Xie, Fan Yang, Junxiao Yang, Qianshuo Ye, Ziyu Ye, Guangtao Zeng, Yuwen Ebony Zhang, Zeyu Zhang, Zihao Zhu, Bernard Ghanem, Philip Torr, Guohao Li
cs.AI

摘要

近期,大型語言模型(LLMs)的進展顯示,通過可驗證獎勵的強化學習(RLVR),其推理能力在數學和編程等領域得到了顯著提升,這些領域的正確性可以自動評估。然而,將這一成功擴展到其他推理密集型領域仍面臨挑戰,主要由於高質量、可驗證數據集的稀缺以及人工監督的高成本。在本研究中,我們介紹了龍項目:一個開源框架,旨在跨多種推理密集型領域實現可擴展的合成數據生成與驗證。該框架包含兩個關鍵組件:(1) 龍標(LoongBench),一個精選的種子數據集,涵蓋12個領域(如高等數學、化學、邏輯)的8,729個人類審核示例,每個示例均配備可執行代碼和豐富的元數據;(2) 龍境(LoongEnv),一個模塊化的合成數據生成環境,支持多種提示策略以生成新的問題-答案-代碼三元組。這些組件共同構成了一個代理-環境循環,支持強化學習,其中基於LLM的代理因生成與代碼執行答案一致的思維鏈(CoT)解決方案而獲得獎勵。實證上,我們在廣泛的開源和專有LLMs上對龍標進行了基準測試,以評估領域覆蓋範圍並揭示性能瓶頸。此外,我們對龍境生成的合成數據進行了全面分析,考察了其正確性、難度和多樣性。代碼和文檔可在https://github.com/camel-ai/loong獲取。
English
Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.
PDF41September 5, 2025