VibeThinker-3B：探索小型語言模型中可驗證推理的前沿

摘要

本技術報告介紹了 VibeThinker-3B，一個具有 3B 參數的緊湊密集模型，旨在探討在嚴格的小模型範疇內，可驗證推理能力能推進至何種極限。基於 Spectrum-to-Signal 後訓練範式，我們透過一套優化管線系統性地增強模型，該管線包括基於課程的監督微調、多域強化學習以及離線自我蒸餾。實驗評估顯示，VibeThinker-3B 在高度要求的可驗證任務上達到了前沿水準。具體而言，在 AIME26 上獲得 94.3 分（透過聲明級測試時擴展可提升至 97.1 分），在 LiveCodeBench v6 上達到 80.2 的 Pass@1，並展現出強大的分佈外泛化能力，在近期未見過的 LeetCode 競賽中達到 96.1% 的接受率。這使其有效躋身一線推理系統的性能區間，與規模大數個數量級的旗艦模型（如 DeepSeek V3.2、GLM-5 和 Gemini 3 Pro）相當甚至超越。此外，IFEval 上 93.4 分的成績確認了這種極端的推理增強並未損害嚴格的指令可控性。延伸我們先前 1.5B 的工作，這些發現催生了參數壓縮-覆蓋假說（Parametric Compression-Coverage Hypothesis），該假說將可驗證推理視為可壓縮至緊湊推理核心的過程，而開放域知識與通用能力則需要對事實、概念及長尾情境進行廣泛的參數覆蓋。此觀點表明，緊湊模型不僅是便於部署的替代方案，更是在參數密集能力範疇中通往前沿性能的互補路徑。

English

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.