Enigmata:利用可驗證的合成謎題擴展大型語言模型的邏輯推理能力
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
May 26, 2025
作者: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
cs.AI
摘要
大型語言模型(LLMs),如OpenAI的o1和DeepSeek的R1,在通過可驗證獎勵的強化學習(RLVR)處理數學和編碼等高級推理任務方面表現出色,但在解決人類無需領域知識即可破解的謎題時仍顯吃力。我們推出了Enigmata,這是首個專為提升LLMs謎題推理能力而設計的綜合套件。它包含七個類別下的36項任務,每項任務均配備:1)一個能生成無限示例且難度可控的生成器,以及2)一個基於規則的自動評估驗證器。這種生成器-驗證器設計支持可擴展的多任務RL訓練、細粒度分析及無縫RLVR集成。我們進一步提出了Enigmata-Eval,一個嚴格的基準測試,並開發了優化的多任務RLVR策略。我們訓練的模型Qwen2.5-32B-Enigmata,在Enigmata-Eval、ARC-AGI(32.8%)和ARC-AGI 2(0.6%)等謎題推理基準上持續超越o3-mini-high和o1。它還能在域外謎題基準和數學推理上展現良好的泛化能力,且多任務處理的權衡影響甚微。當在更大模型如Seed1.5-Thinking(200億激活參數和2000億總參數)上訓練時,Enigmata提供的謎題數據進一步提升了在AIME(2024-2025)、BeyondAIME和GPQA(Diamond)等高級數學和STEM推理任務上的SoTA性能,顯示了Enigmata出色的泛化效益。本工作為推進LLMs的邏輯推理提供了一個統一且可控的框架。相關資源可訪問https://seed-enigmata.github.io獲取。
English
Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at
advanced reasoning tasks like math and coding via Reinforcement Learning with
Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans
without domain knowledge. We introduce Enigmata, the first comprehensive suite
tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks
across seven categories, each with 1) a generator that produces unlimited
examples with controllable difficulty and 2) a rule-based verifier for
automatic evaluation. This generator-verifier design supports scalable,
multi-task RL training, fine-grained analysis, and seamless RLVR integration.
We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized
multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata,
consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks
like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes
well to out-of-domain puzzle benchmarks and mathematical reasoning, with little
multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking
(20B activated parameters and 200B total parameters), puzzle data from Enigmata
further boosts SoTA performance on advanced math and STEM reasoning tasks such
as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization
benefits of Enigmata. This work offers a unified, controllable framework for
advancing logical reasoning in LLMs. Resources of this work can be found at
https://seed-enigmata.github.io.Summary
AI-Generated Summary