ChatPaper.aiChatPaper

VibeThinker-3B:探索小语言模型可验证推理的前沿

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

June 15, 2026
作者: Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang
cs.AI

摘要

本技术报告介绍 VibeThinker-3B,一个具有 3B 参数的紧凑稠密模型,旨在探究在严格的小模型范畴内,可验证推理能推进到何种程度。基于“频谱到信号”后训练范式,我们通过优化流程系统性地增强模型,该流程包括基于课程的有监督微调、多域强化学习以及离线自蒸馏。实验评估表明,VibeThinker-3B 在要求极高的可验证任务上达到了前沿水平。具体而言,它在 AIME26 上取得 94.3 分(利用论元级别测试时扩展可提升至 97.1),在 LiveCodeBench v6 上取得 80.2 的 Pass@1,并在近期未见过的 LeetCode 竞赛中展现出强大的分布外泛化能力,接受率达 96.1%。这使其有效跻身一流推理系统的性能区间,匹配甚至超越规模大数个数量级的旗舰模型,如 DeepSeek V3.2、GLM-5 和 Gemini 3 Pro。此外,IFEval 上 93.4 的得分证实,这种极端的推理增强并未损害严格的指令可控性。延展我们之前 1.5B 的工作,这些发现提出了“参数压缩-覆盖假说”,该假说认为可验证推理可压缩为紧凑推理核心,而开放域知识和通用能力则需要广泛的参数覆盖以应对事实、概念和长尾场景。该视角表明,紧凑模型不仅是部署高效的替代方案,更是实现前沿性能的参数密集能力范式中一条互补路径。
English
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.