Enigmata:通过可验证的合成谜题扩展大语言模型的逻辑推理能力
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
May 26, 2025
作者: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
cs.AI
摘要
诸如OpenAI的o1和DeepSeek的R1等大型语言模型(LLMs),通过可验证奖励的强化学习(RLVR)在数学和编程等高级推理任务上表现出色,但在无需领域知识即可由人类解决的谜题上仍显不足。我们推出了Enigmata,这是首个专为提升LLMs谜题推理能力而设计的全面套件。它包含七个类别下的36项任务,每项任务均配备:1)一个能生成无限示例且难度可控的生成器,以及2)一个基于规则的验证器用于自动评估。这种生成器-验证器设计支持可扩展的多任务RL训练、细粒度分析及无缝RLVR集成。我们进一步提出了Enigmata-Eval,一个严格的基准测试,并开发了优化的多任务RLVR策略。我们训练的模型Qwen2.5-32B-Enigmata,在Enigmata-Eval、ARC-AGI(32.8%)和ARC-AGI 2(0.6%)等谜题推理基准上持续超越o3-mini-high和o1。同时,它在域外谜题基准和数学推理上也展现出良好的泛化能力,几乎没有多任务权衡的损失。当在更大模型如Seed1.5-Thinking(200亿激活参数和2000亿总参数)上训练时,Enigmata提供的谜题数据进一步提升了在高级数学和STEM推理任务(如AIME(2024-2025)、BeyondAIME和GPQA(Diamond))上的SoTA性能,展示了Enigmata出色的泛化优势。本工作为推进LLMs的逻辑推理提供了一个统一且可控的框架。本工作的资源可在https://seed-enigmata.github.io找到。
English
Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at
advanced reasoning tasks like math and coding via Reinforcement Learning with
Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans
without domain knowledge. We introduce Enigmata, the first comprehensive suite
tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks
across seven categories, each with 1) a generator that produces unlimited
examples with controllable difficulty and 2) a rule-based verifier for
automatic evaluation. This generator-verifier design supports scalable,
multi-task RL training, fine-grained analysis, and seamless RLVR integration.
We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized
multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata,
consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks
like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes
well to out-of-domain puzzle benchmarks and mathematical reasoning, with little
multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking
(20B activated parameters and 200B total parameters), puzzle data from Enigmata
further boosts SoTA performance on advanced math and STEM reasoning tasks such
as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization
benefits of Enigmata. This work offers a unified, controllable framework for
advancing logical reasoning in LLMs. Resources of this work can be found at
https://seed-enigmata.github.io.Summary
AI-Generated Summary